Big data abstracts

Author | Ding Yanjun

In daily work or study, we often encounter such helplessness:

“Xiao Ren, please send me the code in this PDF.”

Fuck, damn, 2M PDFcan’t be finished by 12 o ‘clock!





Most of the time, when you study, you find that many documents are in PDF format, but PDF format is not conducive to learning, so you need to convert PDF files to Word files. However, you may download a lot of software from the Internet, but only the first five pages can be converted (such as WPS, etc.), or you need to charge, is there any free conversion software?

So, we bring you a free, easy and quick way to teach you how to batch process PDF files in Python, get the content you want, and save it as Word.

Before implementing the PDF to Word function, we need a Python writing and running environment, and install the relevant dependency packages. PyCharm is recommended for python environments. Anaconda provides easy installation and deployment in a local computer environment.

The dependency packages required by the PDF to Word function are as follows:

  • PDFParser document Parser
  • PDFDocument
  • PDFResourceManager
  • PDFPageInterpreter
  • PDFPageAggregator
  • LAParams (parameter parser)

Preparatory work

Note: this article is the latest version of python for Windows7, version 3.6

1. Install pdfMiner3K module

After installing Anaconda, you can install it directly through PIP


2. If the installation fails, try the following methods

Download pdfminer3k:pypi.python.org/pypi/pdfmin… ; Then install PDfMiner, unpack the downloaded PDfMiner3K to D: or other appropriate drive letter, open the run window through Win + R, enter CMD; Type D: Switch to disk D, CD pdfMiner3k (the folder from which the PDF was extracted), and type setup.py install to install the software.


If Finished is displayed, the system succeeds

The code field

1. Import related packages

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
Copy the code

The overall idea is: construct the document object, parse the document object, extract the required content



2. Import the PDF file to be parsed

Place the files to be parsed in the same directory as the executing code, as shown in the figure below:



The test PDF content

3. The specific code is as follows:

from pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.layout import LAParams from pdfminer.converter import PDFPageAggregator from pdfminer.pdfinterp import PDFTextExtractionNotAllowed def parse(): Fn = open('test.pdf','rb') # Create a PDF document parser. Parser = PDFParser() # Create a PDF document With the document object parser.set_document() doc.set_parser() # Provide the initialization password doc.initialize("lianxipython") # Create an empty string if there is no password Doc. The initialize (" ") # test for a provide TXT document conversion, omit it does not provide the if not doc. Is_extractable: raise PDFTextExtractionNotAllowed else: Laparams = laparams () # laparams = laparams () # la aggregator, an object for reading documents, device = PDFPageAggregator(resource,laparams=laparams) # Interpreter = PDFPageInterpreter(Resource,device) # Loop through the list, # doc.get_pages() gets a list of pages for page in doc.get_pages(): Process_page (page) # Use the aggregator get_result() method to obtain the content layout = device.get_result() Get_text () if hasattr(out,"get_text"): print(out.get_text()) with open('test.txt','a') as f: f.write(out.get_text()+'\n') if __name__ == '__main__': parse()Copy the code

The result of test.txt is as follows:


The end of the

This is the introduction of Python batch PDF to Word operation, this article only as a kind of application library to show the code writing process, the specific technology also need interested friends, with me to discuss specialized research, mutual learning progress.

The opinions expressed in this article are personal.

About the author:

An amateur programmer obsessed with Python language, after half a year of hard practice, experienced from entry to give up, now fortunately to be obsessed with Python state. The ideal of the future is to be able to do meaningful things with a bunch of Python crazy programmers. Zhihu column link: www.zhihu.com/people/cai-…