In my work, I often encounter the need to extract text from PDF files. A PDF is fine, but it doesn’t take much time to copy and paste it. What if I need to convert a large number of PDF files to Word?


Today we teach you to use 60 lines of code, multithreaded batch PDF to Word. If you don’t want to see how it works, you can just pull it to the end. There’s code.


Task decomposition

How to convert PDF to Word? The first step is to read the PDF file, and the second step is to write the Word file.


Yes, it’s that simple. The above two processes can be easily implemented with Python third-party packages, pdfMiner3K and Python-docx.


Read the PDF

from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams resource_manager = PDFResourceManager() return_str = StringIO() lap_params = LAParams() device = TextConverter(resource_manager, return_str, laparams=lap_params) process_pdf(resource_manager, device, File) // file is a PDF file handle opened with the open method device.close() // Where content is converted to text content = return_str.getValue ()Copy the code

The Content variable stores the text we read from a PDF file, and as you can see, pdfMiner3K is an easy way to do this. Next we need to write the text into a Word file.


Write the Word

from docx import Document


doc = Document()
for line in content.split('\n'):
    paragraph = doc.add_paragraph()
    paragraph.add_run(remove_control_characters(line))
doc.save(file_path)
Copy the code

Content is the text content we read before. Since the whole PDF is to be read as a string, we need to use split method to separate each line and then write word in line, otherwise all the words will be in the same line. This code also uses a remove_control_characters function, which needs to be implemented to remove control characters (newlines, tabs, escapes, etc.) because python-Docx does not support writing control characters.

def remove_control_characters(content):
    mpa = dict.fromkeys(range(32))
    return content.translate(mpa)
Copy the code

Control characters are those with ASCII values below 32, so use STR translate to remove characters below 32.


It works, but it’s too slow!


If we use the above code to convert 100 PDF files, we will find that the speed is unacceptably slow and each PDF will take a long time to convert. Don’t worry, we will introduce multi-threading to convert multiple PDFS at the same time, which will speed up the conversion.

import os
from concurrent.futures import ProcessPoolExecutor


with ProcessPoolExecutor(max_workers=int(config['max_worker'])) as executor:
    for file in os.listdir(config['pdf_folder']):
        extension_name = os.path.splitext(file)[1]
        if extension_name != '.pdf':
            continue
        file_name = os.path.splitext(file)[0]
        pdf_file = config['pdf_folder'] + '/' + file
        word_file = config['word_folder'] + '/' + file_name + '.docx'
        print('正在处理: ', file)
        result = executor.submit(pdf_to_word, pdf_file, word_file)
        tasks.append(result)
while True:
    exit_flag = True
    for task in tasks:
        if not task.done():
            exit_flag = False
    if exit_flag:
        print('完成')
        exit(0)
Copy the code

Use concurrent package in Python standard library to implement multi-process. Pdf_to_word method is the encapsulation of the above logic to read PDF and write WORD. The following while loop queries whether the task is completed.


The effect

So far, we have implemented multi-threaded batch conversion of PDF to Word documents. Take a famous article and try it out. The result is shown in the picture (word converted on the left and PDF on the right) :



I don’t want to write code, I just want to use

All of the code described in this article has been packaged into a standalone, runnable project and stored on Github. If you don’t want to write your own code, you can clone it or download the Github project. The project address is as follows (remember star) :

simpleapples/pdf2word​github.com


Welcome to add group communication

The QR code is at the bottom of the page opened by this link