This article shows you how to use Python to bulk extract the text content of many PDF files and store it in a data box for subsequent data analysis.
The problem
Recently, readers’ comments backstage have become more varied.
After writing several articles on natural language processing, a chorus of voices grew louder:
Teacher, is there any convenient way to extract the text content in PDF?
I can feel the reader’s mood.
In the examples I’ve shown, text data can be read directly into the data box tool for processing. They may come from open data sets, web apis, or crawlers.
Sometimes, however, you will have problems dealing with data in a specified format.
Such as PDF.
Many academic papers, research reports, and even data sharing are published in this format.
At this point, you have mastered a lot of natural language analysis tools, will be quite “drawn sword all around the heart at a loss” feeling – clearly know how to process the text information, but just a format conversion problem, can not do it.
How to do?
There are tools, such as dedicated tools, online conversion services, and even manual copy and paste.
But we value efficiency, right?
Some of the above methods need to transmit a large amount of content on the Internet, which takes a lot of time and may bring security and privacy problems. Some need special money to buy; Some are simply unrealistic.
How to do?
The good news is that Python will help you efficiently and quickly extract PDF text in bulk, and it will work seamlessly with data analysis tools to provide the foundation for subsequent analysis.
This article shows you the process in detail.
Would you like to try it?
data
To better illustrate the process, I have prepared a zip package for you.
It contains the code for this tutorial, along with the data we will use.
Please go to this website to download the tutorial package.
Download it and unzip it, and you’ll see the following contents in the generated directory (hereinafter referred to as the “demo directory”).
The demo directory contains:
- Pipfile: the pipenv configuration file used to prepare the dependency packages that we need to change into. The following article will explain how to use;
pdf_extractor.py
: Use pdfMiner.six prepared auxiliary function. It allows you to call PDFMiner’s PDF text extraction functionality without having to worry about a bunch of annoying parameters.demo.ipynb
The Python source code for this tutorial (Jupyter Notebook format) has been written for you.
In addition, two folders are included in the demo directory.
Inside these two folders are Chinese PDF files, used to show you PDF content extraction. They are all papers from Chinese core journals that I published a few years ago.
Here are two points:
- I use my own paper as an example, because I am afraid of using other people’s papers for text extraction, which will lead to intellectual property disputes with the paper authors and database operators.
- The purpose of the two folders is to show you what the extraction tool does when you add a new PDF file.
The contents of the PDF folder are as follows:
The newPDF folder contains the following contents:
With the data ready, let’s deploy the code runtime environment.
The environment
An easier way to install Python is to install the Anaconda package.
Please download the latest version of Anaconda at this website.
Select Python 3.6 on the left to download and install.
If you need a step-by-step guide, or want to know how to install and run Anaconda on Windows, check out this video tutorial I’ve prepared for you.
After installing Anaconda, open the terminal and use the CD command to go to the demo directory.
If you don’t know how to use it, you can also refer to the video tutorial.
We need to install some environment dependency packages.
First execute:
pip install pipenv
Copy the code
Installed here is pipenv, an excellent Python package management tool. After installation, perform the following steps:
pipenv install --skip-lock
Copy the code
The Pipenv tool will automatically install all the dependency packages we need according to the Pipfile.
The terminal will have a progress bar, indicating the number of software required to install and the actual progress.
After installation, we execute as prompted:
pipenv shell
Copy the code
This brings us to the virtual runtime environment exclusive to this tutorial.
Be sure to implement the following sentence:
python -m ipykernel install --user --name=py36
Copy the code
Only then will the current Python environment be registered with the system as the kernel and named PY36.
Make sure you have Google Chrome installed on your computer.
We implement:
jupyter notebook
Copy the code
The default browser (Google Chrome) will open and the Jupyter laptop interface will start:
You can click on the ipynb file, the first item in the list of files, to see the entire sample code for this tutorial.
You can execute the code in turn while watching the tutorial.
What I suggest, however, is to go back to the main screen and create a new blank Python 3 notebook (showing the one with the name PY36).
Please follow the tutorial and enter the corresponding content character by character. This will help you understand the code more deeply and internalize your skills more effectively.
When you have trouble writing code, refer back to the demo.ipynb file.
With all the preparation done, let’s start typing the code.
code
First, we read in some modules for file manipulation.
import glob
import os
Copy the code
As mentioned earlier, in the demo directory, there are two folders, PDF and newPDF.
We specify the PDF file path as the PDF folder within it.
pdf_path = "pdf/"
Copy the code
We want to get the path to all PDF files. With glob, a single command can do this.
pdfs = glob.glob("{}/*.pdf".format(pdf_path))
Copy the code
Let’s see if we got the correct PDF file path.
pdfs
Copy the code
['PDF/Research on the Diffusion model of Fake Information in Microblog for Complex System Simulation. PDF'.'PDF/Social Media Competitive Intelligence Gathering for Shadow Analysis.pdf'.'PDF/Analysis of Mobile Internet Government Portal for Human-machine Collaboration. PDF']
Copy the code
Verified. Accurate.
Here we use PDfMiner to extract content from PDF files. We need to read the function extract_PDF_content from the auxiliary Python file pdf_extractor.py.
from pdf_extractor import extract_pdf_content
Copy the code
With this function, we try to extract the content from the first post in the PDF file list and store the text in the content variable.
content = extract_pdf_content(pdfs[0])
Copy the code
Let’s take a look at what’s in content:
content
Copy the code
Obviously, the content extraction is not perfect, headers and footers are mixed in.
However, for many of our textual analysis purposes, this is irrelevant.
You’ll see a lot of \n in the content. What is that?
We use the print function to display the content.
print(content)
Copy the code
As you can clearly see, those \n are newlines.
We built confidence through an extraction test of PDF files.
Now, it’s time to set up a dictionary to extract and store content in bulk.
mydict = {}
Copy the code
We iterate through the PDFS list with the file name (not including the directory) as the key value. This way, we can easily see which PDF files have been extracted and which have not.
To make the process clearer, let’s have Python print the PDF file name being extracted.
for pdf in pdfs:
key = pdf.split('/') [- 1]
if not key in mydict:
print("Extracting content from {} ...".format(pdf))
mydict[key] = extract_pdf_content(pdf)
Copy the code
During extraction, you will see the following output:
Research on the diffusion model of micro-blog false information in reproductive... Competitive intelligence collection in social media for shadow analysis. PDF... An analysis of the human-machine collaboration mobile Internet government Portal. PDF...Copy the code
Look at the keys in the dictionary at this point:
mydict.keys()
Copy the code
dict_keys(['Research on the Diffusion model of Micro-blog Disinformation for Complex System Simulation. PDF'.'Competitive Intelligence Gathering in Social Media for Shadow Analysis.pdf'.'Human-machine Collaboration oriented Mobile Internet Government Portal analysis. PDF'])
Copy the code
Everything is normal.
We call PANDAS to turn the dictionary into a data box for analysis.
import pandas as pd
Copy the code
The following statement converts the dictionary into a data box. Note that the following reset_index() converts the index generated by the original dictionary key values into a regular column as well.
df = pd.DataFrame.from_dict(mydict, orient='index').reset_index()
Copy the code
We then rename the column for later use.
df.columns = ["path"."content"]
Copy the code
The data box is as follows:
df
Copy the code
As you can see, our data box has the PDF file information and all the text content. This allows you to use keyword extraction, sentiment analysis, similarity calculation, and many more analytics tools.
Due to space constraints, we will only use a character count example to show the basic analysis capabilities.
We asked Python to help us count the length of the extracted content.
df["length"] = df.content.apply(lambda x: len(x))
Copy the code
The following changes occur in the data enclosure:
df
Copy the code
The extra column is the number of characters in the PDF text content.
To display the drawing results correctly in Jupyter Notebook, we need to use the following statement:
%matplotlib inline
Copy the code
Below, we ask Pandas to bar graph the text in the character length column. In order to make the display beautiful, we set the image length to width ratio, and the corresponding PDF file name is displayed at a 45 degree tilt.
import matplotlib.pyplot as plt
plt.figure(figsize=(14.6))
df.set_index('path').length.plot(kind='bar')
plt.xticks(rotation=45)
Copy the code
Visual analysis completed.
Let’s break down the analysis process into functions that can be called more easily in the future.
Let’s first integrate the PDF content into the dictionary module:
def get_mydict_from_pdf_path(mydict, pdf_path):
pdfs = glob.glob("{}/*.pdf".format(pdf_path))
for pdf in pdfs:
key = pdf.split('/') [- 1]
if not key in mydict:
print("Extracting content from {} ...".format(pdf))
mydict[key] = extract_pdf_content(pdf)
return mydict
Copy the code
Enter the existing dictionary and PDF folder path here. The output is a new dictionary.
You may wonder why you need to type “existing dictionary” at all. Don’t worry, I’ll show you a practical example later.
The following function is pretty straightforward — it converts the dictionary into a data box.
def make_df_from_mydict(mydict):
df = pd.DataFrame.from_dict(mydict, orient='index').reset_index()
df.columns = ["path"."content"]
return df
Copy the code
The final function plots the number of characters counted.
def draw_df(df):
df["length"] = df.content.apply(lambda x: len(x))
plt.figure(figsize=(14.6))
df.set_index('path').length.plot(kind='bar')
plt.xticks(rotation=45)
Copy the code
Now that the function is written, let’s try it out.
Remember the demo directory has a subdirectory called NewPDF, right?
Let’s move two of the PDF files to the PDF directory.
Under the PDF directory, there are five files:
We execute the three newly organized functions.
First enter the existing dictionary (note that there are already 3 records in the dictionary). The path of the PDF folder remains unchanged. The output is the new dictionary.
mydict = get_mydict_from_pdf_path(mydict, pdf_path)
Copy the code
Corporate competitive Intelligence collection of Twitter... Research on the protection measures of mobile social media user privacy. PDF...Copy the code
Note here that the original 3 PDF files are not extracted again, only 2 new PDF files are extracted.
There are only five files in total, so you may not see a significant difference intuitively.
But let’s say you’ve spent hours extracting hundreds of PDFS and your boss throws you three new PDFS…
If you had to extract information from scratch, it would be overwhelming.
At this point, using our function, you can append the contents of the new file in less than 1 minute.
That’s a big difference, isn’t it?
So let’s build our data box with our new dictionary.
df = make_df_from_mydict(mydict)
Copy the code
We draw the number of characters in the PDF extracted text in the new data box. The results are as follows:
draw_df(df)
Copy the code
At this point, the code is displayed.
summary
To sum up, this article introduces the following knowledge points for you:
- How to use glob to batch read the file path of the specified format under the directory;
- How to use PDFMiner to extract text information from PDF files;
- How to build dictionaries that store content corresponding to key values (file names in this case) and avoid reprocessing data;
- How to easily convert dictionary data structures to the Pandas data box for subsequent data analysis.
- How to easily draw histogram statistics using the drawing functions that come with Matplotlib and PANDAS.
discuss
In your previous data analysis work, have you ever had to extract text from a PDF file? How did you deal with it? Are there better tools and methods? Welcome to leave a message, share your experience and thinking to everyone, we exchange and discuss together.
If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.
If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.