For many people, converting a PDF to editable text is a necessity, but there is no easy way. In this project, Lucas Soares, a senior machine learning engineer at K1 Digital, tried to automatically transcribe PDF slides using OCR (optical Character recognition), with good results.
A traditional lecture is usually accompanied by a set of PDF slides. Generally speaking, to take notes from these lectures, you need to copy and paste a lot of content from the PDF.
Recently, Lucas Soares, a senior machine learning engineer from K1 Digital, has been trying to automatically transcribe PDF slides by using OCR (Optical Character Recognition) to manipulate their contents directly within markdown files, thereby avoiding manual copying and pasting of PDF content. Automate the process.Project author Lucas Soares. Project Address:Github.com/EnkrateiaLu…
Why not use a traditional PDF to text tool?
Lucas Soares found that traditional tools tend to create more problems that take time to solve. He tried using traditional Python packages, but ran into a lot of problems (such as having to parse the final output using complex regular expression patterns), so he decided to try object detection and OCR.
The basic process can be divided into the following steps:
- Convert PDF to image;
- Detect and recognize text in images;
- Show sample output.
OCR based on deep learning transcribes PDF into text
Convert PDF to image
Soares uses PDF slides from David Silver’s enhanced learning (see PDF slide address below). Convert each slide to PNG image format using the “PDf2Image” package.Sample PDF slide show. Address:www.davidsilver.uk/wp-content/…
The code is as follows:
from pdf2image import convert_from_path
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
pdf_path = "path/to/file/intro_RL_Lecture1.pdf"
images = convert_from_path(pdf_path)
for i, image in enumerate(images):
fname = "image" + str(i) + ".png"
image.save(fname, "PNG")
Copy the code
After processing, all PDF slides are converted to PNG images: Detect and recognize text in images
To detect and recognize text in PNG images, Soares uses a text detector in the OCR. Pytorch library. Follow the instructions to download the model and save it in the Checkpoints folder.
OCR. Pytorch library address: github.com/courao/ocr….
The code is as follows:
# adapted from this source: https://github.com/courao/ocr.pytorch %load_ext autoreload %autoreload 2 import os from ocr import ocr import time import shutil import numpy as np import pathlib from PIL import Image from glob import glob import matplotlib.pyplot as plt import seaborn as sns sns.set() import pytesseract def single_pic_proc(image_file): image = np.array(Image.open(image_file).convert('RGB')) result, image_framed = ocr(image) return result,image_framed image_files = glob('./input_images/*.*') result_dir = './output_images_with_boxes/' # If the output folder exists we will remove it and redo it. if os.path.exists(result_dir): shutil.rmtree(result_dir) os.mkdir(result_dir) for image_file in sorted(image_files): result, image_framed = single_pic_proc(image_file) # detecting and recognizing the text filename = pathlib.Path(image_file).name output_file = os.path.join(result_dir, image_file.split('/')[-1]) txt_file = os.path.join(result_dir, image_file.split('/')[-1].split('.')[0]+'.txt') txt_f = open(txt_file, 'w') Image.fromarray(image_framed).save(output_file) for key in result: txt_f.write(result[key][1]+'\n') txt_f.close()Copy the code
Set up the input and output folders, then walk through all the input images (converted PDF slides), then run the detection and recognition model in the OCR module using the single_pic_proc () function, and finally save the output to the output folder.
Pytorch CTPN and Pytorch CRNN are inherited from the Pytorch CTPN and Pytorch CRNN, both of which exist in the OCR module.
Sample output
The code is as follows:
import cv2 as cv
output_dir = pathlib.Path("./output_images_with_boxes")
# image = cv.imread(str(np.random.choice(list(output_dir.iterdir()),1)[0]))
image = cv.imread(f"{output_dir}/image7.png")
size_reshaped = (int(image.shape[1]),int(image.shape[0]))
image = cv.resize(image, size_reshaped)
cv.imshow("image", image)
cv.waitKey(0)
cv.destroyAllWindows()
Copy the code
The original PDF slideshow is shown on the left and the transcribed output text is shown on the right. The accuracy of transcribed text is very high.The text recognition output is as follows:
filename = f"{output_dir}/image7.txt"
with open(filename, "r") as text:
for line in text.readlines():
print(line.strip("\n"))
Copy the code
By doing this, you end up with a very powerful tool for transcribing all kinds of documents, from detecting and recognizing handwritten notes to detecting and recognizing random text in photos. It is better to have your own OCR tool to handle some text than to rely on external software to transcribe documents.
The original link: towardsdatascience.com/faster-note…