PDF is an anomaly. There are many libraries for PDF processing, but none are perfect.

A, pdfminer3k

Pdfminer 3k is a python3 version of pdfminer that is primarily used to read text in PDF.

There are plenty of pdfMiner3K code examples on the web, and after looking at them, I just want to make fun of the fact that they are so complex that they violate Python’s simplicity.

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed

path = "test.pdf"

Create a PDF document parser using file objects
praser = PDFParser(open(path, 'rb'))
Create a PDF document
doc = PDFDocument()
Connect the parser to the document object
praser.set_document(doc)
doc.set_parser(praser)

Provide the initial password
Create an empty string if there is no password
doc.initialize()

# check whether the document provides TXT conversion, if not, ignore
if not doc.is_extractable:
    raise PDFTextExtractionNotAllowed
else:
    Create PDf explorer to manage shared resources
    rsrcmgr = PDFResourceManager()
    Create a PDF device object
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    Create a PDF interpreter object
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Loop through the list one page at a time
    for page in doc.get_pages():
        interpreter.process_page(page)                        
        Accept an LTPage object for this page
        layout = device.get_result()
        # Here is a LTPage object, which stores the various objects parsed by this page
        # Include LTTextBox, LTFigure, LTImage, LTTextBox horizontal, etc
        for x in layout:
            if isinstance(x, LTTextBox):
                print(x.get_text().strip())
Copy the code

Pdfminer’s processing of tables is very unfriendly and can extract text, but no format:

Screenshot of PDF form:

Code running results:

It is not easy to restore this result to a table, and adding too many rules inevitably leads to a loss of universality.

Second, the tabula – py

Tabula is specifically designed to extract PDF table data and supports PDF export to CSV and Excel formats, but the tool is written in Java and relies on Java 7/8. Tabula-py is a Python wrapper around it, so it also relies on Java 7/8.

The code is simple:

import tabula

path = 'test.pdf'

df = tabula.read_pdf(path, encoding='gbk', pages='all')
for indexs in df.index:
    print(df.loc[indexs].values)

# tabula.convert_into(path, os.path.splitext(path)[0]+'.csv', pages='all')
Copy the code

Although it claims to be a professional processing of PDF forms, the actual effect is not very good. Pdfminer, the results are as follows:

This result is really embarrassing ah, the table header recognition is wrong, and there are two tables in the PDF, I did not find how to distinguish the table.

Third, pdfplumber

Pdfplumber processes PDFS on a per-page basis, gets access to all the text on the page, and provides a separate method for retrieving the form.

import pdfplumber

path = 'test.pdf'
pdf = pdfplumber.open(path)

for page in pdf.pages:
    Get all the text of the current page, including the text in the table
    # print(page.extract_text())                        

    for table in page.extract_tables():
        # print(table)
        for row in table:
            print(row)
        print('---------- splitter ----------')

pdf.close()
Copy the code

The resulting table is a two-dimensional array of string type, which is displayed in rows for comparison with tabula.

It can be seen that compared with Tabula, first of all, it can distinguish tables, second, the accuracy is also much higher, the table header recognition is completely correct. For tables with newlines, it’s not quite right, but at least the column partition is okay, so we can handle it.

import pdfplumber
import re

path = 'test1.pdf'
pdf = pdfplumber.open(path)

for page in pdf.pages:
    print(page.extract_text())
    for pdf_table in page.extract_tables():
        table = []
        cells = []
        for row in pdf_table:
            if not any(row):
                # If a row is empty, it is treated as a record
                if any(cells):
                    table.append(cells)
                    cells = []
            elif all(row):
                # if none of the rows is empty, this is a new line, ending the previous one
                if any(cells):
                    table.append(cells)
                    cells = []
                table.append(row)
            else:
                if len(cells) == 0:
                    cells = row
                else:
                    for i in range(len(row)):
                        if row[i] is not None:
                            cells[i] = row[i] if cells[i] is None else cells[i] + row[i]
        for row in table:
            print([re.sub('\s+'.' ', cell) if cell is not None else None for cell in row])
        print('---------- splitter ----------')

pdf.close()
Copy the code

After processing, the results of running are as follows:

This result is exactly right, and with tabula, even with processing, you can’t get this result. Of course, for different PDF, may need different processing, the actual situation or to analyze their own.

Pdfplumber also has some inaccuracies, mainly in the lack of columns:

I found another PDF, and the screenshot of the form is as follows:

The analytical results are as follows:

Four columns become two columns, and this is also a problem if the table has merged cells. I picked this table because it’s special because it lacks columns without merging cells. This should have something to do with when the PDF was generated.

But the data is actually retrieved completely, it’s not lost, it’s just considered untabulated. Output page.extract_text() as follows:

Then I tried it again with Tabula, and the results were as follows:

The columns are complete, but where is the top of the list??

Pdfplumber also provides the graphical Debug function. You can obtain screenshots of PDF pages and use boxes to frame the identified text or forms to help judge the identification of PDF and adjust the configuration. To use this feature, you also need to install ImageMagick. Because did not use, so temporarily did not go to investigate.

Four, afterword.

When we do crawler, we will inevitably encounter PDF parsing, mainly for text and table data extraction. There are too many libraries for Python to process PDF, such as pypDF2. There are also many online materials, but I tried, but it was garbled. I did not read the source code carefully, so this problem has not been solved.

After comparing the three commonly used libraries, I think pdFPlumber is more useful and has the best support for forms.