preface
I believe that converting PDF documents to Word documents is a very common problem in your work, often can not find an effective solution, you can use a lot of tools and software conversion, but most of them need to charge, can not guarantee the conversion format, the integrity of the content. For those of you who don’t want to pay, today will show you how to convert a PDF document to Word with 64 lines of Python code.
There are many powerful libraries in Python that can help us solve many of the problems we face in our work. Pdfminer and Python-DOCX are two development libraries used to convert PDF documents to Word documents.
Pdfminer, PDFMiner is used to extract information from PDF documents tool library, through the library can identify PDF documents in the picture, text, tables and other information, with the characteristics of fast parsing and efficient extraction.
Note: PdfMiner3K used in Python 3 and PDfMiner used in Python 2 are not compatible.
Python -docx, a python tool for reading and writing Word documents. In the process of PDF to Word, write the extracted pdfMiner into Word documents.
The installation
PIP install pdfMiner3k.
Install pdfMiner in Python 2.
PIP install python-docx.
Start Coding
Import the necessary development packages
import sys
import importlib
from docx import Document
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBoxHorizontal
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed, PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfparser import PDFParser, PDFDocument
importlib.reload(sys)
Copy the code
This is just a demonstration. In the actual process, you can choose to import the development package as prompted by the development tool PyCharm.
Create a PDF2Word class to initialize pdfMiner related variables.
class PDF2Word:
def __init__(self, pdf_path):
Open PDF in binary read mode
fp = open(pdf_path, 'rb')
Create a PDF document parser using file objects
parser = PDFParser(fp)
Create a PDF document
self.doc = PDFDocument()
Connect the parser to the document object
parser.set_document(self.doc)
self.doc.set_parser(parser)
Provide the initial password
Create an empty string if there is no password
self.doc.initialize()
Copy the code
PDF to Word implementation method
# PDF to Word
def pdf_to_word(self, sve_path):
# check whether the document provides TXT conversion, if not, ignore
if not self.doc.is_extractable:
raise PDFTextExtractionNotAllowed
else:
Create PDf explorer to manage shared resources
rsrcmgr = PDFResourceManager()
Create a PDF device object
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
Create a PDF interpreter object
interpreter = PDFPageInterpreter(rsrcmgr, device)
Count the number of pages, images, curves, figures, horizontal text boxes, etc
num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0
Create a document object first
document = Document()
# Loop through the list one page at a time
for page in self.doc.get_pages(): # doc.get_pages() gets a list of pages
num_page += 1 # page increment 1
interpreter.process_page(page)
Accept an LTPage object for this page
layout = device.get_result()
for x in layout:
if isinstance(x, LTTextBoxHorizontal): Get text content
results = x.get_text()
document.add_paragraph(results)
document.save(sve_path)
Copy the code
Call the PDF to Word method
if __name__ == '__main__':
pdf_path = 'Alibaba Java Development Manual 1.4.0.pdf'
covertFile = PDF2Word(pdf_path)
covertFile.pdf_to_word('Alibaba Java Development Manual 1.4.0.docx')
Copy the code
At this point, the conversion from PDF to Word document is complete, and the example has been uploaded to GitHub
Github.com/Jboob/pdf2w…