OCR Tesseract-OCR

This is the 16th day of my participation in the August More Text Challenge. For details, see:August is more challenging

Hard and soft environment

windows 10 64bit
Anaconda with python 3.7
nivdia gtx 1066
Opencv 4.4.0
Tesseract 5.0.0 alpha

Introduction to the

Tesseract’s OCR(Optical Character Recognition) engine was first developed by HP LABS in 1985 and later transferred to Google for further development. The project is now hosted on Github and will support Chinese Character Recognition after version 3.0. It has been developed to 5.0 and supports multiple operating systems. This article takes a look at the basic installation of tesseract-ocr, its use, and how to invoke it in Python.

Tesseract – OCR installation

The official download address: tesseract – OCR. Making. IO/tessdoc/Dow… Digi.bib.uni-mannheim. de/tesseract/t… digi.bib.uni-mannheim.de/tesseract/t…

After downloading, install it directly. When installing the components, select the Chinese package as well, because we need to do Chinese character recognition

Next, set the two system environment variables and add the tesseract-ocr installation directory to PATH. The default installation PATH is C:\Program Files\ tesseract-ocr

Then create a new environment variable TESSDATA_PREFIX with the value C:\Program Files\ tesseract-ocr \ tessData

If the installation path has been customized, it is modified accordingly.

Verify the environment

Run the tesseract -v command to view the version number

Using tesseract –list-langs, see which languages are supported for recognition

(base) PS C:\Users\Administrator> tesseract --list-langs
List of available languages (4):
chi_sim
chi_sim_vert
eng
osd
Copy the code

The test results

Find a picture that contains Chinese and test it

tesseract test.png result -l chi_sim
Copy the code

The identification results are saved in a result. TXT file

Used in Python

A third-party library, PyTesseract, is needed here, installed first

pip install pytesseract
Copy the code

Let’s look at some sample code

import cv2 import sys import pytesseract if __name__ == '__main__': if len(sys.argv) < 2: print('Usage: Python ocr_demo.py image.jpg') sys.exit(1) # use the command line argument imPath = sys.argv[1] # -l to identify Chinese # -- OEM uses LSTM as the OCR engine, with optional values 0, 1, 2, 3; # 0 Legacy engine only. # 1 Neural nets LSTM engine only. # 2 Legacy + LSTM engines. # 3 Default, Based on what is available. # -- PSM Set the Page Segmentation mode to automatic config = ('-l chi_sim -- OEM 1 -- PSM 3') im = Imread (imPath, cv2.imread_color) # Text = pytesseract.image_to_string(im, config=config)Copy the code

Running the code with the test image above gives you

PS C:\xugaoxiang\gogs\ Learnopencv \OCR> python.ocr_demo.py C:\Users\Administrator\Desktop\test. PNG https://xugaoxiang.comCopy the code

The resources

Github.com/tesseract-o…
Github.com/madmaze/pyt…

Hard and soft environment

Introduction to the

Tesseract – OCR installation

Verify the environment

The test results

Used in Python

The resources

Related Posts

NLP Learning Notes 17- Machine learning

Clickhouse copy backup mechanism

[RPA robot] PDF batch conversion to picture robot