“This is the 18th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021.”
preface
Pdfplumber is an open source Python tool library that can be used to retrieve information about PDF text contents, headings, tables, sizes, etc.
The installation
First install the PDFPlumber module by running the following command.
pip install pdfplumber
Copy the code
Or use douban mirror source installation.
pip install -i https://pypi.douban.com/simple pdfplumber
Copy the code
case
Here is a list of the entries of the 2020 Chinese College Students Computer Design Competition. The file isPDF
Format, each page contains a table containing the winning information for each team, a total of 158 pages. The first two pages of the form read as follows. The following willPDF
The table is extracted and saved toExcel
In the.
Import the required modules first:
import pdfplumber
import pandas as pd
Copy the code
Reading PDF files
read_path = '2020 Chinese College Students Computer Design Competition Winners list. PDF '
pdf_2020 = pdfplumber.open(read_path)
Copy the code
The Pages property contains information for each page in the PDF, loops through the contents of each page, extracts table data from each page using the extract_table() method, converts the data to DataFrame, and finally merges the data for each page.
result_df = pd.DataFrame()
for page in pdf_2020.pages:
table = page.extract_table()
df_detail = pd.DataFrame(table[1:], columns=table[0])
# Merge data sets per page
result_df = pd.concat([df_detail, result_df], ignore_index=True)
Copy the code
The DataFrame contains the following data:You can see throughextract_table()
The extracted data contains many columns with missing values, and we need to further process the DataFrame to remove all columns with missing values.
result_df.dropna(axis=1, how='all', inplace=True)
Copy the code
When the missing value is deleted, the column name is also deleted, and the corresponding column name needs to be specified.
result_df.columns = ['award'.'Work No.'.'Title of Work'.'Participating Schools'.'the writer'.'Advisor']
Copy the code
So far we have successfully extracted the complete table information!
The complete code
import pdfplumber
import pandas as pd
def read_pdf(read_path, save_path) :
pdf_2020 = pdfplumber.open(read_path)
result_df = pd.DataFrame()
for page in pdf_2020.pages:
table = page.extract_table()
print(table)
df_detail = pd.DataFrame(table[1:], columns=table[0])
result_df = pd.concat([df_detail, result_df], ignore_index=True)
result_df.dropna(axis=1, how='all', inplace=True)
result_df.columns = ['award'.'Work No.'.'Title of Work'.'Participating Schools'.'the writer'.'Advisor']
result_df.to_excel(excel_writer=save_path, index=False, encoding='utf-8')
read_path = R '2020 Chinese College Students Computer Design Contest Winners list PDF '
save_path = R '2020 Chinese College Students Computer Design Contest Winners list XLSX '
read_pdf(read_path, save_path)
Copy the code
This is what I want to share today. Search Python New Horizons on wechat, bringing you more useful knowledge every day. More organized nearly a thousand sets of resume templates, hundreds of e-books waiting for you to get oh! In addition, there are Python small white communication group, if you are interested in the way to contact me!