Python Office Automation - Basic use of the PyPDF2 library

PyPDF2, Pdfplumer and PDFminer are three Python libraries that can handle PDF well.

Today’s tutorial focuses on PyPDF2, which implements the following basic operations with PDF

1. Split a single PDF into multiple PDF files;
2. Merge multiple PDFS into one PDF file;
3. Rotate a page in the PDF;
4. Add watermark to PDF;
5. Encrypt PDF;
6. Decrypt the PDF;
6. Obtain PDF basic information, such as author, title, page number, etc.

PyPDF2 history

Before we start, a little bit about the history of PyPDF2. The predecessor of PyPDF was the PyPDF package released in 2005, the last version of the package was released in 2010, and then about a year later, A branch of PyPdf sponsored by a company called Phasit was later named PyPDF2. The two versions were essentially the same, but the main difference was that Python3 support was added to PyPDF2.

PyPDF2 has not been updated recently either. The latest version was released in 2016, but the popularity of PyPDF2 is still there. Although PyPDF3, PyPDF4 and other versions were introduced later, these packages were not fully backward compatible with PyPDF2 functionality and were certainly not as popular with users as PyPDF2

PyPDF2 installation

As with other Python libraries, installation can be done through the PIP or Conda tools

pip install pypdf2
Copy the code

PDF information extraction

Using PyPDF2, you can extract some metadata and text information from PDF to get a general idea of PDF

The data that can be extracted with PyPDF2 is as follows

The author;
The creator;
Producers;
The Subject;
The title;
The number of pages;

Here I downloaded a sample PDF of Seige_of_Vicksburg_Sample_OCR, which is six pages long, as test data

Pdf_path = "D:/Data/ office automation /PDF/Seige_of_Vicksburg_Sample_OCR. PDF "with open(pdf_path,'rb') as f: pdf = PdfFileReader(f) infomation = pdf.getDocumentInfo() number_of_pages = pdf.getNumPages() txt = f'''{pdf_path} information: Author : {infomation.author}, Creator : {infomation.creator}, Producer : {infomation.producer}, Subject : {infomation.subject}, Title : {infomation.title}, Number of pages : {number_of_pages} ''' print(txt)Copy the code

The following is the printed result

D:/Data/ office Automation /PDF/ seige_of_vicksburg_sample_ocr. PDF Information: Author: DSI, Creator: LuraDocument PDF Compressor Server 5.5.46.38, Producer: LuraDocument PDF v2.38, Subject: None, Title: Binder1.pdf, Number of pages : 6Copy the code

In the example above, the PdfFileReader class is used to interact with PDF files; Calling the getDocumentInfo() method in this class returns an instance of DocumentInformation that holds the desired information. The getNumPages method on the Reader object also returns the number of pages in the document;

In my opinion, the data in this section is only valuable for page count, which is very useful for batch counting

PDF page rotation

In PyPDF2, each page of a PDF exists as a page object. To return an instance of a page, use the get_Page(page_index) method in the reader object, where page_index indicates the index

There are two ways to rotate a page

RotateClockwise (90), rotated 90 degrees clockwise;
RotateCounterClockwise – 90 degrees

The following code indicates that the first page of the target PDF is rotated 90 degrees clockwise, the second page is rotated 90 degrees counterclockwise, and the other page position angles remain unchanged;

from PyPDF2 import PdfFileReader,PdfFileWriter pdf_writer = PdfFileWriter() pdf_reader = PdfFileReader(pdf_path) # Rotate page 90 degrees to the right page_1 = pdf_reader.getPage(0).rotateClockwise(90) pdf_writer.addPage(page_1) # Rotate page 90 degrees to the left page_2 = pdf_reader.getPage(1).rotateCounterClockwise(90) pdf_writer.addPage(page_2) For I in range(2,pdf_reader.getNumPages()): pdf_writer.addPage(pdf_reader.getPage(i)) with open(pdf_path, 'wb') as fh: pdf_writer.write(fh)Copy the code

The results are as follows

Code at the same time use the PdfFileReader, PdfFileWriter these two classes, page rotation is not operated on the basis of the original PDF but create a new PDF flow in memory object, after the operation of each page by the addPage () method to join this object, Then write the object in memory to a file;

Here, to be honest, the page rotation function is basically useless, it is just added to serve as a number of words, hahaha

Split a single PDF into multiple PDFS

The from PyPDF2 import PdfFileReader, PdfFileWriter pdf_path = # # PDF document "D:/Data/ auf /PDF/ seige_of_vicksburg_sample_ocr.pdf "save_path = 'D:/Data/ auF /PDF/' # Split Pages of PDF_reader =  PdfFileReader(pdf_path) for i in range(0,pdf_reader.getNumPages()): pdf_writer = PdfFileWriter() pdf_writer.addPage(pdf_reader.getPage(i)) # Every page write to a path with open(save_path+'{}.pdf'.format(str(i)), 'wb') as fh: pdf_writer.write(fh) print('{} Save Sucessfully ! \n'.format(str(i)))Copy the code

The code splits each page in the PDF file into each PDF file, where the file name is named by the page index;

It is also possible to extract a fixed page range from a PDF file by splitting it. For example, IF I want to extract only 2-5 pages from the PDF and nothing else, the code will look like this

Pdf_path = PdfFileReader() for I in range(1,5): # pdf_writer = PdfFileWriter() pdf_writer.addPage(pdf_reader.getPage(i)) # Every page write to a path with open(save_path+'2_5.pdf', 'wb') as fh: pdf_writer.write(fh)Copy the code

Multiple PDF files are merged into a single file

PDF split and merge direction although opposite, but the use of classes, principles are the same

PdfFileReader reads each PDF and recursively retrieves each page object. PdfFileWrite creates a new stream object and writes the page objects read from memory in sequence to the stream object, and finally to the disk file

The from PyPDF2 import PdfFileReader, PdfFileWriter p1_pdf = "D: / Data/office/PDF/Seige_of_Vicksburg_Sample_OCR automation. PDF" p2_pdf = "D:/Data/ office automation /PDF/ seige_of_vicksburg_sample_ocr. PDF "merge_pdf = 'D:/Data/ office automation /PDF/ merge_pdf 'p1_reader = PdfFileReader(p1_pdf) p2_reader = PdfFileReader(p2_pdf) merge = PdfFileWriter() # Write p1 for i in range(0,p1_reader.getNumPages()): merge.addPage(p1_reader.getPage(i)) # Write p2 for j in range(0,p2_reader.getNumPages()): merge.addPage(p2_reader.getPage(j)) # Write out with open(merge_pdf,'wb') as f: merge.write(f)Copy the code

The result is as follows

Add watermark to PDF

In today’s list of so many functions, I think this function is the most useful, batch add watermark mainly used in the Page object margePage() method, by merging two pages to achieve the effect of adding a watermark

Since PyPDF2 can only manipulate PDF objects, you need to store the watermark to a PDF file before adding it

The from PyPDF2 import PdfFileReader, PdfFileWriter watermark = "D: / Data/office automation/PDF/watermark. PDF 'input_pdf = PDF 'output = 'D:/Data/ watermark /PDF/ merge_water. PDF' watermark_obj = PdfFileReader(watermark) watermark_page = watermark_obj.getPage(0) pdf_reader = PdfFileReader(input_pdf) pdf_writer = PdfFileWriter() # Watermark  all the pages for page in range(pdf_reader.getNumPages()): page = pdf_reader.getPage(page) page.mergePage(watermark_page) pdf_writer.addPage(page) with open(output, 'wb') as out: pdf_writer.write(out)Copy the code

The effect is as follows, from left to right, the original image, the watermark, the original image after adding the watermark

The above effect is not good because the layout of the page is not taken into account when making the watermark, so part of the merge is missing;

The advantage of using the above code to add watermark is that you can specify the PDF page field watermark, such as only add even pages on odd pages, not only flexible and efficient, of course, you can also batch operation of multiple files

PDF encryption and decryption

PDF encryption

For a PDF file, if we do not want others to be able to read the contents, we can set a password for it through pypDF2. If it is only a single file, it is recommended to find a tool for manual operation, which will be more efficient, but for multiple files, it is very recommended to use the following method

Watermark = 'D:/Data/ audo /PDF/ seige_of_vicksburg_sample_ocr. PDF 'input_pdf = 'D:/Data/ Audo /PDF/merge.pdf' output = PDF 'watermark_obj = PdfFileReader(watermark) watermark_page = watermark_obj.getPage(0) pdf_reader = PdfFileReader(input_pdf) pdf_writer = PdfFileWriter() # Watermark all the pages for page in range(pdf_reader.getNumPages()): page = pdf_reader.getPage(page) page.mergePage(watermark_page) pdf_writer.addPage(page) pdf_writer.encrypt(user_pwd='123456', use_128bit=True) with open(output, 'wb') as out: pdf_writer.write(out)Copy the code

The encrypt function is used, and you need to pay attention to three parameters

User_pwd, STR, user password, used to restrict open read files;
Owner_pwd, STR, which is one level higher than the user password and can open files without any restriction. If this parameter is not specified, owner_pwd is the same as user_pwd by default.
Use_128bit Specifies whether to use 128 bits as the password. If the value is False, it indicates that the password is 40 bits. The default value is True.

PDF decrypt

Decryption is used when a file is read, using the decrypt() function

rom PyPDF2 import PdfFileWriter, PdfFileReader

input_pdf='reportlab-encrypted.pdf'
output_pdf='reportlab.pdf'
password='twofish'

pdf_writer = PdfFileWriter()
pdf_reader = PdfFileReader(input_pdf)
pdf_reader = pdf_reader.decrypt(password)

for page in range(pdf_reader.getNumPages()):
     pdf_writer.addPage(pdf_reader.getPage(page))

with open(output_pdf, 'wb') as fh:
      pdf_writer.write(fh)
Copy the code

In the example above, decryption works by reading an encrypted file and writing it to an unencrypted PDF

summary

This paper introduces the basic usage of PyPDF2 library, with the help of it plus code examples to achieve some basic operations; A word of caution: all of the above is only suitable for batch operation scenarios. If the object is a single file, it is recommended to use the general method, too showy is a waste of time

Pdfplumber and PDFminer are much better at text extraction. If you want to do a good job, you must first make a sharp tool. I’ll cover this in the next tutorial, and I look forward to your attention!

Well, that’s all for this article. Thank you for reading, and we’ll see you next time

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python Office Automation – Basic use of the PyPDF2 library

PyPDF2 history

PyPDF2 installation

PDF information extraction

PDF page rotation

Split a single PDF into multiple PDFS

Multiple PDF files are merged into a single file

Add watermark to PDF

PDF encryption and decryption

summary

Python Office Automation – Basic use of the PyPDF2 library

PyPDF2 history

PyPDF2 installation

PDF information extraction

PDF page rotation

Split a single PDF into multiple PDFS

Multiple PDF files are merged into a single file

Add watermark to PDF

PDF encryption and decryption

summary

Related Posts

LeetCode brush – Get the maximum value in the generated array

The architecture principle and implementation of EasyScheduler

The order in which threads are executed may really be what you think you think