My friend needed to split a PDF file, and found this pypDF2 could complete these operations after checking online, so I studied this library and made some records. First, pypdf2 is python3, and there was a corresponding pypdf library in the previous version 2.
You can install directly using PIP:
pip install pypdf2
Copy the code
Official document: pythonhosted.org/PyPDF2/
There are mainly these categories:
PdfFileReader.
This class mainly provides reading operations on PDF files. Its construction method is as follows:
PdfFileReader(stream, strict=True, warndest=None, overwriteWarnings=True)
Copy the code
The first argument can be passed in a file stream, or a file path. The last three parameters are used to set how warnings are handled, using the default values.
Now that you have the example, you can do something with the PDF. The main operations are as follows:
-
Decrypt (password) : This method is used to decrypt PDF files if they are encrypted.
-
GetDocumentInfo () : Retrieves some information about a PDF file. The return value is a DocumentInformation type. Output directly will yield information similar to the following:
{‘/ModDate’: “D:20150310202949-07’00′”, ‘/Title’: ”, ‘/Creator’: ‘LaTeX with hyperref package’, ‘/CreationDate’: “D:20150310202949-07’00′”, ‘/PTEX.Fullbanner’: ‘This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2014/MacPorts 2014_6) kpathsea Version 6.2.0’, ‘/Producer’: ‘pdfTeX – 1.40.15’, ‘/ Keywords’ :’, ‘/ Trapped’ : ‘/ False’, ‘/ Author’ : ‘, ‘/ Subject’ : ‘}
-
GetNumPages () : This is the number of pages in the PDF file.
-
GetPage (pageNumber) : the PageObject corresponding to the pageNumber pageNumber in the PDF file is returned as a PageObject instance. After you get the PageObject instance, you can add it, insert it, and so on.
-
GetPageNumber (Page) : As opposed to the above method, you can pass in an instance of PageObject and get the page number in the PDF file.
-
GetOutlines (Node =None, Outlines =None) : Retrieves Outlines of documents that appear in documents.
-
IsEncrypted: records whether the PDF isEncrypted. If the file itself is encrypted, it returns true even after using the decrypt method.
-
NumPages: The total number of pages in a PDF, equivalent to accessing the read-only property of getNumPages().
PdfFileWriter.
This class supports writing to PDF files, usually using PdfFileReader to read some PDF data, and then using this class to perform some operations.
No parameters are required to create an instance of this class.
The main methods are as follows:
-
Addattinfringement (fname, fdata) : Adding documentation to PDF.
-
AddBlankPage (width=None, height=None) : Adds a blank page to the end of the PDF, using the size of the last page of the PDF in the current Weiter if no size is specified.
-
AddPage: Adds a page to a PDF, usually from the Reader above.
-
AppendPagesFromReader (reader, after_page_append = None) : Copies the data from reader into the current Writer instance, and, if after_page_append is specified, finally returns the function and passes the data from Writer into it.
-
Encrypt (user_pwd, owner_pwd = None, use_128bit = True) : Userpwd allows users to open PDF files with limited permissions, which may be limited if the password is used, but I can’t find the content of setting permissions in the document. Ownerpwd allows unlimited use. The third parameter is whether to use 128-bit encryption.
-
GetNumPages () : Get the number of PDF pages.
-
GetPage (pageNumber) : getPage(pageNumber) : get the corresponding Page, is a PageObject, you can use the above addPage method to addPage.
-
InsertPage (Page, index=0) : Adds the page to the PDF. Index specifies where the page was inserted.
-
Write (stream) : writes the content of the Writer to a file.
PdfFileMerger.
This class is used to merge PDF files. The constructor of this class takes one parameter: PdfFileMerger(strict=True). Note that this parameter is described later: PdfFileMerger(strict=True)
Common methods:
-
AddBookmark (title, pagenum, parent=None) : Add a bookmark to the PDF. Title is the title of the bookmark and pagenum is the page that the bookmark points to.
-
Append (fileobj, bookmark=None, pages=None, import_bookmarks=True) : Pages can use (start, stop[, step]) or a Page Range to add a specified Range of pages to fileobj.
-
Merge (position, FileOBj, Bookmark =None, Pages =None, import_bookmarks=True) : Similar to the Append method, but you can specify the position to add using the position argument.
-
Write (Fileobj) : Writes data to a file.
To use this, create a PdfFileMerger instance, then use Append or Merge to add the PDF files you want to merge in turn, and save using write.
def merge_pdf():
Create an instance to merge files
pdf_merger = PdfFileMerger()
# Add a week1_1.pdf file first
pdf_merger.append('Week1_1.pdf')
Then add the ex1.pdf file at the end of page 0
pdf_merger.merge(0, 'ex1.pdf')
# bookmark
pdf_merger.addBookmark('This is a bookmark'1),Write it to a file
pdf_merger.write('merge_pdf.pdf')
Copy the code
Let’s look at this parameter in PdfFileMerger(strict=True) :
The official explanation for this parameter:
Strict (bool) — Determines whether user should be warned of all problems and also causes some correctable problems to be fatal. Defaults to True.
Determine if the user should be warned of all problems and if any can be corrected.
At first, it seems that this parameter is used to warn the user of some errors. The default can be used directly, but when I try to merge the PDF with Chinese, I get the following error:
Traceback (most recent call last):
File "I: \ python3.5 \ lib \ site - packages \ PyPDF2 \ generic py." ", line 484, in readFromStream
return NameObject(name.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 10: invalid continuation byte During handling of the above exception, another exception occurred: PyPDF2.utils.PdfReadError: Illegal character in Name ObjectCopy the code
There was an error using UTF decoding in the source package, I tried to modify the source code to use GBK, but there were other errors as well. When strict in the constructor is set to False, the console prints the following error:
PdfReadWarning: Illegal character in Name Object [generic.py:489]
Copy the code
But the two files were successfully merged, and I looked at the merged files sometimes good or bad, the same code runs many times, sometimes can normally deal with Chinese, but sometimes Chinese garbled.
In addition to the methods listed, there are other methods, such as bookmarking, adding links, etc., which can be found in the official documentation.
Merge, split and encrypt PDF.
Encrypt, decrypt, merge, split by page, split by copy
Note: If you run Chinese files, garbled characters may appear in the results. However, if you run them several times, Chinese characters may be displayed normally. It’s not clear exactly why, but that’s how metaphysical…
Code portal
# @Time : 2018/3/26 23:48
# @Author : Leafage
# @File : handlePDF.py
# @Software: PyCharm
# @describe: Merge, split and encrypt PDF files.
from PyPDF2 import PdfFileReader, PdfFileMerger, PdfFileWriter
def get_reader(filename, password):
try:
old_file = open(filename, 'rb')
except IOError as err:
print('File open failed! ' + str(err))
return None
Create a read instance
pdf_reader = PdfFileReader(old_file, strict=False)
# decrypt operation
if pdf_reader.isEncrypted:
if password is None:
print('%s file is encrypted, password required! ' % filename)
return None
else:
ifpdf_reader.decrypt(password) ! = 1:print('%s password is incorrect! ' % filename)
return None
if old_file in locals():
old_file.close()
return pdf_reader
def encrypt_pdf(filename, new_password, old_password=None, encrypted_filename=None):
"""Encrypt the file corresponding to filename and generate a new file :param filename: file path :param new_password: password used for file encryption :param old_password: If the old file is encrypted, the password is :param encrypted_filename: specifies the encrypted filename. Filename_encrypted is used when saving. :return: """
Create a Reader instance
pdf_reader = get_reader(filename, old_password)
if pdf_reader is None:
return
Create a write instance
pdf_writer = PdfFileWriter()
Write data from the previous Reader to Writer
pdf_writer.appendPagesFromReader(pdf_reader)
# re-encrypt with the new password
pdf_writer.encrypt(new_password)
if encrypted_filename is None:
# use old file name + encrypted as new file name
encrypted_filename = "".join(filename.split('. ')] [: - 1) +'_' + 'encrypted' + '.pdf'
pdf_writer.write(open(encrypted_filename, 'wb'))
def decrypt_pdf(filename, password, decrypted_filename=None):
"""Decrypt the encrypted file retrograde and generate a password-free PDF file :param filename: previously encrypted PDF file :param password: corresponding password: param decrypted_filename: Decrypted file name :return:"""
Create a Reader and a Writer
pdf_reader = get_reader(filename, password)
if pdf_reader is None:
return
if not pdf_reader.isEncrypted:
print('File is not encrypted, no action required! ')
return
pdf_writer = PdfFileWriter()
pdf_writer.appendPagesFromReader(pdf_reader)
if decrypted_filename is None:
decrypted_filename = "".join(filename.split('. ')] [: - 1) +'_' + 'decrypted' + '.pdf'
Write a new file
pdf_writer.write(open(decrypted_filename, 'wb'))
def split_by_pages(filename, pages, password=None):
""Param filename: specifies the name of the file to be split. Param pages: specifies the number of pages of each file to be split. Param password: Decrypts the file if it is encrypted.""
# get Reader
pdf_reader = get_reader(filename, password)
if pdf_reader is None:
return
Get the total number of pages
pages_nums = pdf_reader.numPages
if pages <= 1:
print('Each document must be larger than 1 page! ')
return
Get the number of pages per PDF file after shard
pdf_num = pages_nums // pages + 1 if pages_nums % pages else int(pages_nums / pages)
print('PDF files are divided into % D copies with % D pages each! ' % (pdf_num, pages))
Generate PDF files in turn
for cur_pdf_num in range(1, pdf_num + 1):
Create a new write instance
pdf_writer = PdfFileWriter()
Generate the corresponding file name
split_pdf_name = "".join(filename)[:-1] + '_' + str(cur_pdf_num) + '.pdf'
# calculate the current start position
start = pages * (cur_pdf_num - 1)
# Calculate the end position, return the last page if it was the last one, otherwise use the number of pages per page * the number of files already divided
end = pages * cur_pdf_num ifcur_pdf_num ! = pdf_numelse pages_nums
# print(str(start) + ',' + str(end))
# read the corresponding pages in sequence
for i in range(start, end):
pdf_writer.addPage(pdf_reader.getPage(i))
Write file
pdf_writer.write(open(split_pdf_name, 'wb'))
def split_by_num(filename, nums, password=None):
"""Divide PDF file into nums: param filename: filename: param nums: number of shares to be divided into :param password: if decryption is required, enter the password: return:"""
pdf_reader = get_reader(filename, password)
if not pdf_reader:
return
if nums < 2:
print('Copies must not be less than 2! ')
return
Get the total number of pages in the PDF
pages = pdf_reader.numPages
if pages < nums:
print('The number of copies should not be greater than the total number of pages in PDF! ')
return
# Calculate how many pages each should have
each_pdf = pages // nums
print('PDF has % D pages, divided into % D copies, each has % D pages! ' % (pages, nums, each_pdf))
for num in range(1, nums + 1):
pdf_writer = PdfFileWriter()
Generate the corresponding file name
split_pdf_name = "".join(filename)[:-1] + '_' + str(num) + '.pdf'
# calculate the current start position
start = each_pdf * (num - 1)
# Calculate the end position, return the last page if it was the last one, otherwise use the number of pages per page * the number of files already divided
end = each_pdf * num ifnum ! = numselse pages
print(str(start) + ', ' + str(end))
for i in range(start, end):
pdf_writer.addPage(pdf_reader.getPage(i))
pdf_writer.write(open(split_pdf_name, 'wb'))
def merger_pdf(filenames, merged_name, passwords=None):
"""Pass in a list of files and merge them together: Param filenames: list of files: Param passwords: list of corresponding passwords: return:"""
# count how many files there are
filenums = len(filenames)
Note that the False argument is required
pdf_merger = PdfFileMerger(False)
for i in range(filenums):
# get password
if passwords is None:
password = None
else:
password = passwords[i]
pdf_reader = get_reader(filenames[i], password)
if not pdf_reader:
return
# appEnd is added to the end by default
pdf_merger.append(pdf_reader)
pdf_merger.write(open(merged_name, 'wb'))
def insert_pdf(pdf1, pdf2, insert_num, merged_name, password1=None, password2=None):
""Insert param pdf1: pdf1 file name :param pdf2: pdf2 file name :param insert_num: Number of pages to be added :param merged_name: indicates the merged file name: param password1: pdF1 Password :param password2: pdf2 password :return:"""
pdf1_reader = get_reader(pdf1, password1)
pdf2_reader = get_reader(pdf2, password2)
# If one fails to open, return
if not pdf1_reader or not pdf2_reader:
return
Get the total number of pages pdf1
pdf1_pages = pdf1_reader.numPages
if insert_num < 0 or insert_num > pdf1_pages:
print('Insert position is abnormal, the number of pages to insert is: %d, pdf1 file has: %d pages! ' % (insert_num, pdf1_pages))
return
The False parameter is required
m_pdf = PdfFileMerger(False)
m_pdf.append(pdf1)
m_pdf.merge(insert_num, pdf2)
m_pdf.write(open(merged_name, 'wb'))
if __name__ == '__main__':
# encrypt_pdf('ex1.pdf', 'leafage')
# decrypt_pdf('ex1123_encrypted.pdf', 'leafage')
# split_by_pages('ex1.pdf', 5)
split_by_num('ex2.pdf', 3)
# merger_pdf(['ex1.pdf', 'ex2.pdf'], 'merger.pdf')
# insert_pdf('ex1.pdf', 'ex2.pdf', 10, 'pdf12.pdf')
Copy the code