PyPDF2, Pdfplumer and PDFminer are three Python libraries that can handle PDF well.
Today’s tutorial focuses on PyPDF2, which implements the following basic operations with PDF
- 1. Split a single PDF into multiple PDF files;
- 2. Merge multiple PDFS into one PDF file;
- 3. Rotate a page in the PDF;
- 4. Add watermark to PDF;
- 5. Encrypt PDF;
- 6. Decrypt the PDF;
- 6. Obtain PDF basic information, such as author, title, page number, etc.
PyPDF2 history
Before we start, a little bit about the history of PyPDF2. The predecessor of PyPDF was the PyPDF package released in 2005, the last version of the package was released in 2010, and then about a year later, A branch of PyPdf sponsored by a company called Phasit was later named PyPDF2. The two versions were essentially the same, but the main difference was that Python3 support was added to PyPDF2.
PyPDF2 has not been updated recently either. The latest version was released in 2016, but the popularity of PyPDF2 is still there. Although PyPDF3, PyPDF4 and other versions were introduced later, these packages were not fully backward compatible with PyPDF2 functionality and were certainly not as popular with users as PyPDF2
PyPDF2 installation
As with other Python libraries, installation can be done through the PIP or Conda tools
pip install pypdf2
Copy the code
PDF information extraction
Using PyPDF2, you can extract some metadata and text information from PDF to get a general idea of PDF
The data that can be extracted with PyPDF2 is as follows
- The author;
- The creator;
- Producers;
- The Subject;
- The title;
- The number of pages;
Here I downloaded a sample PDF of Seige_of_Vicksburg_Sample_OCR, which is six pages long, as test data
Pdf_path = "D:/Data/ office automation /PDF/Seige_of_Vicksburg_Sample_OCR. PDF "with open(pdf_path,'rb') as f: pdf = PdfFileReader(f) infomation = pdf.getDocumentInfo() number_of_pages = pdf.getNumPages() txt = f'''{pdf_path} information: Author : {infomation.author}, Creator : {infomation.creator}, Producer : {infomation.producer}, Subject : {infomation.subject}, Title : {infomation.title}, Number of pages : {number_of_pages} ''' print(txt)Copy the code
The following is the printed result
D:/Data/ office Automation /PDF/ seige_of_vicksburg_sample_ocr. PDF Information: Author: DSI, Creator: LuraDocument PDF Compressor Server 5.5.46.38, Producer: LuraDocument PDF v2.38, Subject: None, Title: Binder1.pdf, Number of pages : 6Copy the code
In the example above, the PdfFileReader class is used to interact with PDF files; Calling the getDocumentInfo() method in this class returns an instance of DocumentInformation that holds the desired information. The getNumPages method on the Reader object also returns the number of pages in the document;
In my opinion, the data in this section is only valuable for page count, which is very useful for batch counting
PDF page rotation
In PyPDF2, each page of a PDF exists as a page object. To return an instance of a page, use the get_Page(page_index) method in the reader object, where page_index indicates the index
There are two ways to rotate a page
- RotateClockwise (90), rotated 90 degrees clockwise;
- RotateCounterClockwise – 90 degrees
The following code indicates that the first page of the target PDF is rotated 90 degrees clockwise, the second page is rotated 90 degrees counterclockwise, and the other page position angles remain unchanged;
from PyPDF2 import PdfFileReader,PdfFileWriter pdf_writer = PdfFileWriter() pdf_reader = PdfFileReader(pdf_path) # Rotate page 90 degrees to the right page_1 = pdf_reader.getPage(0).rotateClockwise(90) pdf_writer.addPage(page_1) # Rotate page 90 degrees to the left page_2 = pdf_reader.getPage(1).rotateCounterClockwise(90) pdf_writer.addPage(page_2) For I in range(2,pdf_reader.getNumPages()): pdf_writer.addPage(pdf_reader.getPage(i)) with open(pdf_path, 'wb') as fh: pdf_writer.write(fh)Copy the code
The results are as follows
Code at the same time use the PdfFileReader, PdfFileWriter these two classes, page rotation is not operated on the basis of the original PDF but create a new PDF flow in memory object, after the operation of each page by the addPage () method to join this object, Then write the object in memory to a file;
Here, to be honest, the page rotation function is basically useless, it is just added to serve as a number of words, hahaha
Split a single PDF into multiple PDFS
The from PyPDF2 import PdfFileReader, PdfFileWriter pdf_path = # # PDF document "D:/Data/ auf /PDF/ seige_of_vicksburg_sample_ocr.pdf "save_path = 'D:/Data/ auF /PDF/' # Split Pages of PDF_reader = PdfFileReader(pdf_path) for i in range(0,pdf_reader.getNumPages()): pdf_writer = PdfFileWriter() pdf_writer.addPage(pdf_reader.getPage(i)) # Every page write to a path with open(save_path+'{}.pdf'.format(str(i)), 'wb') as fh: pdf_writer.write(fh) print('{} Save Sucessfully ! \n'.format(str(i)))Copy the code
The code splits each page in the PDF file into each PDF file, where the file name is named by the page index;
It is also possible to extract a fixed page range from a PDF file by splitting it. For example, IF I want to extract only 2-5 pages from the PDF and nothing else, the code will look like this
Pdf_path = PdfFileReader() for I in range(1,5): # pdf_writer = PdfFileWriter() pdf_writer.addPage(pdf_reader.getPage(i)) # Every page write to a path with open(save_path+'2_5.pdf', 'wb') as fh: pdf_writer.write(fh)Copy the code
Multiple PDF files are merged into a single file
PDF split and merge direction although opposite, but the use of classes, principles are the same
PdfFileReader reads each PDF and recursively retrieves each page object. PdfFileWrite creates a new stream object and writes the page objects read from memory in sequence to the stream object, and finally to the disk file
The from PyPDF2 import PdfFileReader, PdfFileWriter p1_pdf = "D: / Data/office/PDF/Seige_of_Vicksburg_Sample_OCR automation. PDF" p2_pdf = "D:/Data/ office automation /PDF/ seige_of_vicksburg_sample_ocr. PDF "merge_pdf = 'D:/Data/ office automation /PDF/ merge_pdf 'p1_reader = PdfFileReader(p1_pdf) p2_reader = PdfFileReader(p2_pdf) merge = PdfFileWriter() # Write p1 for i in range(0,p1_reader.getNumPages()): merge.addPage(p1_reader.getPage(i)) # Write p2 for j in range(0,p2_reader.getNumPages()): merge.addPage(p2_reader.getPage(j)) # Write out with open(merge_pdf,'wb') as f: merge.write(f)Copy the code
The result is as follows
Add watermark to PDF
In today’s list of so many functions, I think this function is the most useful, batch add watermark mainly used in the Page object margePage() method, by merging two pages to achieve the effect of adding a watermark
Since PyPDF2 can only manipulate PDF objects, you need to store the watermark to a PDF file before adding it
The from PyPDF2 import PdfFileReader, PdfFileWriter watermark = "D: / Data/office automation/PDF/watermark. PDF 'input_pdf = PDF 'output = 'D:/Data/ watermark /PDF/ merge_water. PDF' watermark_obj = PdfFileReader(watermark) watermark_page = watermark_obj.getPage(0) pdf_reader = PdfFileReader(input_pdf) pdf_writer = PdfFileWriter() # Watermark all the pages for page in range(pdf_reader.getNumPages()): page = pdf_reader.getPage(page) page.mergePage(watermark_page) pdf_writer.addPage(page) with open(output, 'wb') as out: pdf_writer.write(out)Copy the code
The effect is as follows, from left to right, the original image, the watermark, the original image after adding the watermark
The above effect is not good because the layout of the page is not taken into account when making the watermark, so part of the merge is missing;
The advantage of using the above code to add watermark is that you can specify the PDF page field watermark, such as only add even pages on odd pages, not only flexible and efficient, of course, you can also batch operation of multiple files
PDF encryption and decryption
PDF encryption
For a PDF file, if we do not want others to be able to read the contents, we can set a password for it through pypDF2. If it is only a single file, it is recommended to find a tool for manual operation, which will be more efficient, but for multiple files, it is very recommended to use the following method
Watermark = 'D:/Data/ audo /PDF/ seige_of_vicksburg_sample_ocr. PDF 'input_pdf = 'D:/Data/ Audo /PDF/merge.pdf' output = PDF 'watermark_obj = PdfFileReader(watermark) watermark_page = watermark_obj.getPage(0) pdf_reader = PdfFileReader(input_pdf) pdf_writer = PdfFileWriter() # Watermark all the pages for page in range(pdf_reader.getNumPages()): page = pdf_reader.getPage(page) page.mergePage(watermark_page) pdf_writer.addPage(page) pdf_writer.encrypt(user_pwd='123456', use_128bit=True) with open(output, 'wb') as out: pdf_writer.write(out)Copy the code
The encrypt function is used, and you need to pay attention to three parameters
-
User_pwd, STR, user password, used to restrict open read files;
-
Owner_pwd, STR, which is one level higher than the user password and can open files without any restriction. If this parameter is not specified, owner_pwd is the same as user_pwd by default.
-
Use_128bit Specifies whether to use 128 bits as the password. If the value is False, it indicates that the password is 40 bits. The default value is True.
PDF decrypt
Decryption is used when a file is read, using the decrypt() function
rom PyPDF2 import PdfFileWriter, PdfFileReader
input_pdf='reportlab-encrypted.pdf'
output_pdf='reportlab.pdf'
password='twofish'
pdf_writer = PdfFileWriter()
pdf_reader = PdfFileReader(input_pdf)
pdf_reader = pdf_reader.decrypt(password)
for page in range(pdf_reader.getNumPages()):
pdf_writer.addPage(pdf_reader.getPage(page))
with open(output_pdf, 'wb') as fh:
pdf_writer.write(fh)
Copy the code
In the example above, decryption works by reading an encrypted file and writing it to an unencrypted PDF
summary
This paper introduces the basic usage of PyPDF2 library, with the help of it plus code examples to achieve some basic operations; A word of caution: all of the above is only suitable for batch operation scenarios. If the object is a single file, it is recommended to use the general method, too showy is a waste of time
Pdfplumber and PDFminer are much better at text extraction. If you want to do a good job, you must first make a sharp tool. I’ll cover this in the next tutorial, and I look forward to your attention!
Well, that’s all for this article. Thank you for reading, and we’ll see you next time