This is the 16th day of my participation in the August More Text Challenge. For details, see:August is more challenging

Hard and soft environment

  • windows 10 64bit
  • Anaconda with python 3.7
  • nivdia gtx 1066
  • Opencv 4.4.0
  • Tesseract 5.0.0 alpha

Introduction to the

Tesseract’s OCR(Optical Character Recognition) engine was first developed by HP LABS in 1985 and later transferred to Google for further development. The project is now hosted on Github and will support Chinese Character Recognition after version 3.0. It has been developed to 5.0 and supports multiple operating systems. This article takes a look at the basic installation of tesseract-ocr, its use, and how to invoke it in Python.

Tesseract – OCR installation

The official download address: tesseract – OCR. Making. IO/tessdoc/Dow… Digi.bib.uni-mannheim. de/tesseract/t… digi.bib.uni-mannheim.de/tesseract/t…

After downloading, install it directly. When installing the components, select the Chinese package as well, because we need to do Chinese character recognition

Next, set the two system environment variables and add the tesseract-ocr installation directory to PATH. The default installation PATH is C:\Program Files\ tesseract-ocr

Then create a new environment variable TESSDATA_PREFIX with the value C:\Program Files\ tesseract-ocr \ tessData

If the installation path has been customized, it is modified accordingly.

Verify the environment

Run the tesseract -v command to view the version number

Using tesseract –list-langs, see which languages are supported for recognition

(base) PS C:\Users\Administrator> tesseract --list-langs
List of available languages (4):
chi_sim
chi_sim_vert
eng
osd
Copy the code

The test results

Find a picture that contains Chinese and test it

tesseract test.png result -l chi_sim
Copy the code

The identification results are saved in a result. TXT file

Used in Python

A third-party library, PyTesseract, is needed here, installed first

pip install pytesseract
Copy the code

Let’s look at some sample code

import cv2 import sys import pytesseract if __name__ == '__main__': if len(sys.argv) < 2: print('Usage: Python ocr_demo.py image.jpg') sys.exit(1) # use the command line argument imPath = sys.argv[1] # -l to identify Chinese # -- OEM uses LSTM as the OCR engine, with optional values 0, 1, 2, 3; # 0 Legacy engine only. # 1 Neural nets LSTM engine only. # 2 Legacy + LSTM engines. # 3 Default, Based on what is available. # -- PSM Set the Page Segmentation mode to automatic config = ('-l chi_sim -- OEM 1 -- PSM 3') im = Imread (imPath, cv2.imread_color) # Text = pytesseract.image_to_string(im, config=config)Copy the code

Running the code with the test image above gives you

PS C:\xugaoxiang\gogs\ Learnopencv \OCR> python.ocr_demo.py C:\Users\Administrator\Desktop\test. PNG https://xugaoxiang.comCopy the code

The resources

  • Github.com/tesseract-o…
  • Github.com/madmaze/pyt…