Author: Jiang Xia

| zhihu: www.zhihu.com/people/1024…

| GitHub:github.com/JiangXia-10…

| CSDN:blog.csdn.net/qq_4115394…

| the nuggets: juejin. Cn/user / 651387…

| public no. : 1024 notes

This article contains 1469 words and takes 9 minutes to read

The PDF format document is more convenient for us to use, because it will not be abnormal due to the editor and other reasons. However, sometimes we need to modify the document, at this time we need to parse the PDF into Word format. There are many format conversion websites and software on the Internet, but most of them can only be used for free for a few times. If we use them again, we will have to upgrade VIP. Then, if we write a PDF conversion program by ourselves, is it very convenient and niuability?

This article will show you how to use Python to write a PDF conversion tool for Word.

Here I’m using Win10, python version 3.7:

The dependency package used is PDfMiner3K, which can be installed with the following command:

pip install pdfminer3k
Copy the code

The specific code is as follows. The functions of each line of code are written in the notes, so we will not repeat them one by one:

# author: The 2020-10-31 # description: Import sys import importlib importlib.reload(sys) from pdfminer.pdfparser import PDFParser,PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import * from Pdfminer. Pdfinterp import PDFTextExtractionNotAllowed # first define a function of the PDF document # parse PDF file, file contains a variety of objects def parse (pdf_path) : Fp = open(pdf_path, Parser = PDFParser(FP) # Create a PDF document doc = PDFDocument() # Connect parser with document object Parser.set_document (doc) doc.set_parser(parser) doc.initialize() # if not doc.is_extractable: raise PDFTextExtractionNotAllowed else: RSRCMGR = PDFResourceManager() # laparams = laparams () device = PDFPageAggregator(RSRCMGR, laparams=laparams) # create a PDF interpreter object interpreter = PDFPageInterpreter(RSRCMGR, device) # to count pages, Num_page, num_image, num_curve, num_figure, num_TextBoxHorizontal = 0, 0, 0, 0 # doc.get_pages() gets a list of pages for page in doc.get_pages(): # add 1 num_page += 1 interpreter. Process_page (page) # Accept the LTPage object layout = device.get_result() for x in layout: If isinstance(x,LTImage): num_image += 1 If isinstance(x,LTFigure): if isinstance(x, LTTextBoxHorizontal): With open(r'test.doc', 'a',encoding=' utF-8 ') as f: Results = x.goet_text () f.write('\n') # print(' \n',' %s\n'%num_page,' %s\n'%num_page,' %s\n'%num_page,' %s\n'%num_page,' % s \ n '% num_image, curve number: % s \ n' % num_curve, 'level text box: % s \ n' # % num_TextBoxHorizontal) to perform the main function if __name__ = = '__main__' : Pdf_path = r 'c :\Users\Jiang\Desktop\test. PDF 'Copy the code

The content of the PDF document is as follows:

Execution code:

Parse doc documents:

The opening contents are as follows:

Python uses PDfMiner3k to convert PDF to TXT and doc to PDF. This is an example of how python converts PDF to TXT using PDFMiner3k.

In addition, all the actual combat article source will be synchronized to Github, there is a need to welcome the use of download.

Finally, if you think this article is good, just click “like” and “recommend” to more people. Welcome to pay attention to the public number: 1024 notes, free access to massive learning resources (including video, source code, documents)!

Other recommendations:

  • Getting started with Python (3) : Using tuples

  • Getting started with Python (4) : Using sets

  • Getting Started with Python (5) : Dict usage

  • Getting Started with Python 6: Calling custom functions

  • Getting started with Python (I) : String formatting

  • Python crawls taobao commodity information and generates Excel