This is the 18th day of my participation in the August Genwen Challenge.More challenges in August

The environment

  • 10 64 – bit Windows
  • Anaconda with python 3.7
  • Pdf2docx 0.5.2

preface

Converting PDF files to Word files is a very common operation, and I believe most people’s free solution is to use some online conversion service, but there is a data leakage problem. This article introduces an open source and free native conversion tool, PDf2docx.

Install pdf2docx

The installation method is very simple, using the PIP directive, execute

pip install pdf2docx
Copy the code

After the successful installation, in addition to the basic library, pdf2Docx also provides us with an executable file pdf2docx.

For daily use, you can convert PDF to DOCX directly using executable files; If you need to use it in Python code, you can use the API it provides.

Command Line use

Use pdf2docx –help to view the help information on the command line

INFO: Showing help with the command 'pdf2docx -- --help'.

NAME
    pdf2docx - Command line interface for ``pdf2docx``.

SYNOPSIS
    pdf2docx COMMAND | -

DESCRIPTION
    Command line interface for ``pdf2docx``.

COMMANDS
    COMMAND is one of the following:

     convert
       Convert pdf file to docx file.

     debug
       Convert one PDF page and plot layout information for debugging.

     gui
       Simple user interface.

     table
       Extract table content from pdf pages.
Copy the code

The above help lists the directives supported by PDF2DocX. Here we will focus on convert and GUI

  • convert

    This is the core function of convert. Convert itself provides a number of parameters, which can be viewed using pdf2docx convert –help. This method also applies to other directives, which we will not detail later

    (base) PS C:\Users\Administrator> pdf2docx.exe convert --help
    INFO: Showing help with the command 'pdf2docx convert -- --help'.
    
    NAME
        pdf2docx convert - Convert pdf file to docx file.
    
    SYNOPSIS
        pdf2docx convert PDF_FILE <flags>
    
    DESCRIPTION
        Convert pdf file to docx file.
    
    POSITIONAL ARGUMENTS
        PDF_FILE
            Type: str
            PDF filename to read from.
    
    FLAGS
        --docx_file=DOCX_FILE
            Type: Optional[str]
            Default: None
            docx filename to write to. Defaults to None.
        --password=PASSWORD
            Type: Optional[str]
            Default: None
            Password for encrypted pdf. Default to None if not encrypted.
        --start=START
            Type: int
            Default: 0
            First page to process. Defaults to 0.
        --end=END
            Type: Optional[int]
            Default: None
            Last page to process. Defaults to None.
        --pages=PAGES
            Type: Optional[list]
            Default: None
            Range of pages. Defaults to None.
        Additional flags are accepted.
            Configuration parameters.
    
            .. note
    
    NOTES
        You can also use flags syntax for POSITIONAL ARGUMENTS
    Copy the code

    To convert all pages in a PDF, simply perform the following steps

    pdf2docx.exe convert test.pdf test.docx
    Copy the code

    Start on page 3 and finish

    pdf2docx.exe convert test.pdf test.docx --start=2
    Copy the code

    From the beginning to page 10

    pdf2docx.exe convert test.pdf test.docx --end=10
    Copy the code

    From page 2 to page 5

    pdf2docx.exe convert test.pdf test.docx --start=1 --end=5
    Copy the code

    Note that both start and end start at 0

    Of course, discrete pages can also be converted at once, for example

    Exe convert test.pdf test.docx --pages=0,2,4Copy the code

    This can be done if the PDF is encrypted

    pdf2docx.exe convert test.pdf test.docx --password=PASSWORD
    Copy the code
  • gui

    If you’re not used to using the command line, pdf2Docx also provides a simple graphical interface that can be called up by typing into the PDf2Docx GUI from CMD. It is really very rough, the text of the button is not fully displayed, but the function is OK.

The use of API

If you want to convert PDF to DOCx in Python, pdf2Docx provides a full API for us, let’s take a look at the simplest example

from pdf2docx import Converter

    
if __name__ == "__main__":
    
    pdf_file = "test.pdf"
    docx_file = "test.docx"

    conv = Converter(pdf_file)
    conv.convert(docx_file, start=0, end=None)
    conv.close()
Copy the code

More detailed API documentation, you can refer to link dothinking. Making. IO/pdf2docx/mo…

limitations

The current VERSION of PDF2DOCX only works with text-based PDFS and is read from left to right. You need to be careful when you use it.

Python utility module topics

For more useful Python modules, go

Xugaoxiang.com/category/py…

The resources

  • Github.com/dothinking/…