preface

I believe that converting PDF documents to Word documents is a very common problem in your work, often can not find an effective solution, you can use a lot of tools and software conversion, but most of them need to charge, can not guarantee the conversion format, the integrity of the content. For those of you who don’t want to pay, today will show you how to convert a PDF document to Word with 64 lines of Python code.

There are many powerful libraries in Python that can help us solve many of the problems we face in our work. Pdfminer and Python-DOCX are two development libraries used to convert PDF documents to Word documents.

Pdfminer, PDFMiner is used to extract information from PDF documents tool library, through the library can identify PDF documents in the picture, text, tables and other information, with the characteristics of fast parsing and efficient extraction.

Note: PdfMiner3K used in Python 3 and PDfMiner used in Python 2 are not compatible.

Python -docx, a python tool for reading and writing Word documents. In the process of PDF to Word, write the extracted pdfMiner into Word documents.

The installation

PIP install pdfMiner3k.

Install pdfMiner in Python 2.

PIP install python-docx.

Start Coding

Import the necessary development packages

import sys
import importlib
from docx import Document
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBoxHorizontal
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed, PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfparser import PDFParser, PDFDocument
importlib.reload(sys)
    
Copy the code

This is just a demonstration. In the actual process, you can choose to import the development package as prompted by the development tool PyCharm.

Create a PDF2Word class to initialize pdfMiner related variables.

class PDF2Word:
    def __init__(self, pdf_path):
        Open PDF in binary read mode
        fp = open(pdf_path, 'rb')
        Create a PDF document parser using file objects
        parser = PDFParser(fp)
        Create a PDF document
        self.doc = PDFDocument()
        Connect the parser to the document object
        parser.set_document(self.doc)
        self.doc.set_parser(parser)

        Provide the initial password
        Create an empty string if there is no password
        self.doc.initialize()
Copy the code

PDF to Word implementation method

# PDF to Word
    def pdf_to_word(self, sve_path):
        # check whether the document provides TXT conversion, if not, ignore
        if not self.doc.is_extractable:
            raise PDFTextExtractionNotAllowed
        else:
            Create PDf explorer to manage shared resources
            rsrcmgr = PDFResourceManager()
            Create a PDF device object
            laparams = LAParams()
            device = PDFPageAggregator(rsrcmgr, laparams=laparams)
            Create a PDF interpreter object
            interpreter = PDFPageInterpreter(rsrcmgr, device)

            Count the number of pages, images, curves, figures, horizontal text boxes, etc
            num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0, 0

            Create a document object first
            document = Document()
            # Loop through the list one page at a time
            for page in self.doc.get_pages():  # doc.get_pages() gets a list of pages
                num_page += 1  # page increment 1
                interpreter.process_page(page)
                Accept an LTPage object for this page
                layout = device.get_result()
                for x in layout:
                    if isinstance(x, LTTextBoxHorizontal):  Get text content
                        results = x.get_text()
                        document.add_paragraph(results)
            document.save(sve_path)

Copy the code

Call the PDF to Word method

if __name__ == '__main__':
    pdf_path = 'Alibaba Java Development Manual 1.4.0.pdf'
    covertFile = PDF2Word(pdf_path)
    covertFile.pdf_to_word('Alibaba Java Development Manual 1.4.0.docx')
Copy the code

At this point, the conversion from PDF to Word document is complete, and the example has been uploaded to GitHub

Github.com/Jboob/pdf2w…