A step-by-step guide to extracting tables in PDF using Python

“This is the 18th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021.”

preface

Pdfplumber is an open source Python tool library that can be used to retrieve information about PDF text contents, headings, tables, sizes, etc.

The installation

First install the PDFPlumber module by running the following command.

pip install pdfplumber
Copy the code

Or use douban mirror source installation.

pip install -i https://pypi.douban.com/simple pdfplumber
Copy the code

case

Here is a list of the entries of the 2020 Chinese College Students Computer Design Competition. The file isPDFFormat, each page contains a table containing the winning information for each team, a total of 158 pages. The first two pages of the form read as follows. The following willPDFThe table is extracted and saved toExcelIn the.

Import the required modules first:

import pdfplumber
import pandas as pd
Copy the code

Reading PDF files

read_path = '2020 Chinese College Students Computer Design Competition Winners list. PDF '
pdf_2020 = pdfplumber.open(read_path)
Copy the code

The Pages property contains information for each page in the PDF, loops through the contents of each page, extracts table data from each page using the extract_table() method, converts the data to DataFrame, and finally merges the data for each page.

result_df = pd.DataFrame()
for page in pdf_2020.pages:
    table = page.extract_table()
    df_detail = pd.DataFrame(table[1:], columns=table[0])
    # Merge data sets per page
    result_df = pd.concat([df_detail, result_df], ignore_index=True)
Copy the code

The DataFrame contains the following data:You can see throughextract_table()The extracted data contains many columns with missing values, and we need to further process the DataFrame to remove all columns with missing values.

result_df.dropna(axis=1, how='all', inplace=True)
Copy the code

When the missing value is deleted, the column name is also deleted, and the corresponding column name needs to be specified.

result_df.columns = ['award'.'Work No.'.'Title of Work'.'Participating Schools'.'the writer'.'Advisor']
Copy the code

So far we have successfully extracted the complete table information!

The complete code

import pdfplumber
import pandas as pd

def read_pdf(read_path, save_path) :
    pdf_2020 = pdfplumber.open(read_path)
    result_df = pd.DataFrame()
    for page in pdf_2020.pages:
        table = page.extract_table()
        print(table)
        df_detail = pd.DataFrame(table[1:], columns=table[0])
        result_df = pd.concat([df_detail, result_df], ignore_index=True)
    result_df.dropna(axis=1, how='all', inplace=True)
    result_df.columns = ['award'.'Work No.'.'Title of Work'.'Participating Schools'.'the writer'.'Advisor']
    result_df.to_excel(excel_writer=save_path, index=False, encoding='utf-8')

read_path = R '2020 Chinese College Students Computer Design Contest Winners list PDF '
save_path = R '2020 Chinese College Students Computer Design Contest Winners list XLSX '
read_pdf(read_path, save_path)
Copy the code

This is what I want to share today. Search Python New Horizons on wechat, bringing you more useful knowledge every day. More organized nearly a thousand sets of resume templates, hundreds of e-books waiting for you to get oh! In addition, there are Python small white communication group, if you are interested in the way to contact me!

A step-by-step guide to extracting tables in PDF using Python

preface

The installation

case

The complete code

Related Posts

Five IO models for Linux

Initial experience with Java Agent (probe or Agent)

Talk about the core architecture of SpringCloud Zuul