This is the 16th day of my participation in the August More Text Challenge. For details, see:August is more challenging
Hard and soft environment
- windows 10 64bit
- Anaconda with python 3.7
- nivdia gtx 1066
- Opencv 4.4.0
- Tesseract 5.0.0 alpha
Introduction to the
Tesseract’s OCR(Optical Character Recognition) engine was first developed by HP LABS in 1985 and later transferred to Google for further development. The project is now hosted on Github and will support Chinese Character Recognition after version 3.0. It has been developed to 5.0 and supports multiple operating systems. This article takes a look at the basic installation of tesseract-ocr, its use, and how to invoke it in Python.
Tesseract – OCR installation
The official download address: tesseract – OCR. Making. IO/tessdoc/Dow… Digi.bib.uni-mannheim. de/tesseract/t… digi.bib.uni-mannheim.de/tesseract/t…
After downloading, install it directly. When installing the components, select the Chinese package as well, because we need to do Chinese character recognition
Next, set the two system environment variables and add the tesseract-ocr installation directory to PATH. The default installation PATH is C:\Program Files\ tesseract-ocr
Then create a new environment variable TESSDATA_PREFIX with the value C:\Program Files\ tesseract-ocr \ tessData
If the installation path has been customized, it is modified accordingly.
Verify the environment
Run the tesseract -v command to view the version number
Using tesseract –list-langs, see which languages are supported for recognition
(base) PS C:\Users\Administrator> tesseract --list-langs
List of available languages (4):
chi_sim
chi_sim_vert
eng
osd
Copy the code
The test results
Find a picture that contains Chinese and test it
tesseract test.png result -l chi_sim
Copy the code
The identification results are saved in a result. TXT file
Used in Python
A third-party library, PyTesseract, is needed here, installed first
pip install pytesseract
Copy the code
Let’s look at some sample code
import cv2 import sys import pytesseract if __name__ == '__main__': if len(sys.argv) < 2: print('Usage: Python ocr_demo.py image.jpg') sys.exit(1) # use the command line argument imPath = sys.argv[1] # -l to identify Chinese # -- OEM uses LSTM as the OCR engine, with optional values 0, 1, 2, 3; # 0 Legacy engine only. # 1 Neural nets LSTM engine only. # 2 Legacy + LSTM engines. # 3 Default, Based on what is available. # -- PSM Set the Page Segmentation mode to automatic config = ('-l chi_sim -- OEM 1 -- PSM 3') im = Imread (imPath, cv2.imread_color) # Text = pytesseract.image_to_string(im, config=config)Copy the code
Running the code with the test image above gives you
PS C:\xugaoxiang\gogs\ Learnopencv \OCR> python.ocr_demo.py C:\Users\Administrator\Desktop\test. PNG https://xugaoxiang.comCopy the code
The resources
- Github.com/tesseract-o…
- Github.com/madmaze/pyt…