In our daily work and study, we will encounter a problem that is to convert the text content in PDF to word form, that is, to read and write form from read-only. In the face of this situation, most of us use the online tools, but the online tools are intermingled and it is difficult to meet our needs.
Today, I’ll show you how to convert PDF content into Word documents using Python. At the same time, we will also extract the images in the PDF and save them to our designated folder.
01. Text extraction
The first thing we need to do is to extract the PDF Chinese text, as shown in the figure below:
The text in Pdf is only allowed for us to read only, but cannot be changed. So what we need to do is to extract the text information in Pdf, and then write the extracted text into the Word file, so that we can carry out subsequent rewriting. For text extraction, we use pdfMiner function library, whose main functions are shown in the following figure:
- The program first uses the get_content_from_PDF function to return the data extracted from the PDF.
- The PDFResourceManager object is created to hold the shared data content, the PDFPageAggregator object is created to process the resource object into the desired format, and the PDFPageInterpreter object is created to process the page content.
- Page_index in the program is used to help us set which pages to extract. For pages to be extracted, the PDFPageInterpreter object is created to interpret the page information.
- Finally, the PDFPageAggregator object is used to process the data.
The layout here contains the various objects that the page parses out. Including text, pictures and other information. However, xiaobian found that pdFMiner’s effect was not good for image extraction. Therefore, for image extraction, Xiaobian adopted FITZ library for separate processing and achieved good image extraction effect. With that said, let’s take a look at the results of the text processing.
Our PDF is a two-page PDF document, and we only ask the program to extract the text of the first page. As can be seen from the figure above, the program extracts the text of the first page completely without any errors.
02. Image extraction
With text processing behind us, let’s take a look at how to extract images from PDF and save them locally. For image extraction, the procedure is as follows:
In the above application, we use the Fitz library to extract the object in the PDF document, and then use string matching to determine whether the object is a picture type. If not, we can directly skip it.
If the object is an image type, we can create a PixMap object to extract the image and save it to the path we specify. The results are shown below:
As can be seen from the above figure, we correctly extracted the pictures to achieve our goal of image extraction. Besides, the small editor also tried to extract multiple pictures without any pressure. It can extract all images of PDF documents in just a few seconds.
Above is xiaobian for you to bring PDF to Word extraction, we explained, not only completed the extraction of PDF document Chinese, but also completed the extraction of pictures, so as to greatly ease the pressure of our work, improve the efficiency of the work, we also hurry up to download the source code, application.