preface

Tesseract-ocr is a free open source image OCR word recognition engine. Tesseract was originally developed by HP LABS, and later contributed to the open source software industry, which was improved, bug modified, optimized, and re-released by Google. It will convert the text in the image you want to recognize into text at your command. So far, it has supported simplified Chinese, traditional Chinese, English, Japanese, Korean and more than 60 languages recognition.

Environment to prepare

1. Install TesserAct 4.1

  • Digi.bib.uni-mannheim. de/tesseract/

    Go to the download page and select Tesseract-Ocr-w64-setup-v4.1.0.201903.exe

  • During the installation process, click “Next” all the time. The installation path can be customized, and the language library does not need to be checked. Anyway, you need to download the best model on GitHub.

2. Download tessdata_best

What is tessData_best?

Tessdata_best is the best training model for tesseract’s LSTM technology.

  • Download: github.com/tesseract-o…

  • Unzip the following four files and copy and paste them into the tesseract-ocr \ tessData directory

    • chi_sim.traineddata
    • chi_sim_vert.traineddata
    • chi_tra.traineddata
    • chi_tra_vert.traineddata

3. Install pytesseract

To use tesseract functionality in Python code, PyTesserAct is installed using PIP

  • The test code is as follows:

    from PIL import Image
    import pytesseract
    text = pytesseract.image_to_string(Image.open(r'D:\train\1.jpg'),lang='chi_sim')
    print(text)
    Copy the code
  • Errors that may be encountered when using PyTesserAct

     raise TesseractNotFoundError()
    pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path
    Copy the code

    Tesseract is not installed or not in PATH

    Pytesseract after pyTesseract is installed, a pytesseract folder will be generated in the Site-Packges directory of Python Lib. Find Pytesseract.py in the folder and find the following code

    tesseract_cmd = 'tesseract'
    Copy the code

    Tesseract_cmd = ‘tesseract’; tesseract_cmd = ‘D:\ tesseract-ocr \tesseract.exe’;

    The tesseract_cmd configuration is the absolute path where you installed the tesseract, so you can find the tesseract.

4. Configure environment variables

Step 1. Add the installation Path to the Path of the system environment variable

Step 2: Add TessData (the trained font file) to the system variable

The variable name is TESSDATA_PREFIX

The variable value is the path to the tessData directory

After the configuration is complete, enter tesseract -v. If the following information is displayed, the environment variables are successfully configured

Basic instructions:

Tesseract –version Displays the tesseract version

Tesseract –list-langs Views the language libraries currently contained in tesseract

5. Download jTessBoxEditor

Why download jTessBoxEditor?

Tesseract training requires jTessBoxEditor to compile tif files from samples. This software requires a JAVA environment (JRE), which is easy to install and configure environment variables.

Download address: sourceforge.net/projects/vi…

Go to the download page and select “jtessBoxEditor-2.1.1.zip”

After decompressing, double-click jtessBoxEditor.jar to use it. The interface is as follows

Simple conversion

1. Conversion process

Prepare an image file, such as 1.png

To switch the command line to the target image file directory, for example we convert the file to 1.png (image files allow multiple formats), located in D:\train; Then enter it on the command line

tesseract 1.jpg test -l chi_sim --psm 7
Copy the code

Operation as shown below

Open the TXT file and view the output

2. Command interpretation

Tesseract Image name Output file name -l font file -pSM pagesegMode

  • -l chi_sim indicates that simplified Chinese character database is used. (You need to download the Chinese character database file, decompress it, and save it to the TessData directory with the extension name of the character database file. Raineddata The simplified Chinese character database file is named chi_sim.

  • – PSM 7 Indicates that the test. PNG image is a line of text. This parameter reduces the identification error rate. The default is 3

    Page segmentation modes (–psm) Page segmentation mode (translation)
    0 Orientation and script detection (OSD) only. Direction and Script Detection only (OSD)
    1 Automatic page segmentation with OSD. Use OSD automatic paging
    2 Automatic page segmentation, but no OSD, or OCR. (not implemented) Automatic page segmentation, but no OSD or OCR
    3 Fully automatic page segmentation, but no OSD. (Default) Fully automatic page splitting, but no OSD (default)
    4 Assume a single column of text of variable sizes. Imagine a list of variable-size text
    5 Assume a single uniform block of vertically aligned text. Assume a uniform block of vertically aligned text
    6 Assume a single uniform block of text. Assume a uniform block of text
    7 Treat the image as a single text line. Treat the image as a single line of text
    8 Treat the image as a single word. Think of the image as a single word
    9 Treat the image as a single word in a circle. Think of the image as a single word in a circle
    10 Treat the image as a single character. Treat the image as a single character
    11 Sparse text. Find as much text as possible in no particular order. Sparse text. Find as much text as you can, in no particular order
    12 Sparse text with OSD. Sparse text with OSD
    13 Raw line. Treat the image as a single text line, Treat the image as a single line of text

LSTM training process

Introduction to the

Tesseract 4 includes a new neural network-based recognition engine that offers a significant improvement over previous versions in the accuracy of document image recognition. Because tesseract’s Chinese language package “CHI_sim” has a low recognition accuracy rate for Chinese handwriting fonts or pictures in a complex environment, it is necessary to use its own samples for specific training to improve the recognition rate. Through training, it can also form its own language library.

The training process

1. Generate a TIF file

  • Use jTessBoxEditor to merge multiple images to validate and generate a TIF

To perform subsequent operations, run the [lang].[fontname].exp[num].tif command to merge the file name

  • Lang stands for the language name
  • Fontname indicates the fontname
  • Num stands for serial number

We save the tif file name to nmL.num.exp0.tif

2. Generate tif image box file

tesseract nml.num.exp3.tif nml.num.exp3 -l chi_sim batch.nochop makebox

Running the above command will generate a file named NML. Num. Exp3. box in our folder.

3. Add and modify box files

Open the previous JTessBoxEdit software, click the Box Editor TAB, and load the TIF file to modify the Box (Box file and TIF file need to be in the same folder).

4. Generate an LSTMF file

tesseract nml.num.exp3.tif nml.num.exp3 -l chi_sim –psm 6 lstm.train

Running the above command will generate a file named NmL.num. Exp3. LSTMF in our folder.

5. Extract the LSTM file of the language

In step 2 of environment preparation, extract the. LSTM file from the. Traineddata file in tessDatA_best and copy the downloaded. Traineddata file to the training folder

combine_tessdata -e chi_sim.traineddata chisim.lstm

Running the above command will generate a file named chisim.lstm in our folder.

6. Start LSTM training

Note: We need to create a new text file called chitraing.txt and fill it with the absolute path of the LSTMF file generated in step 4 (as shown below)

lstmtraining –model_output=”D:\Program Files (x86)\train1\output” –continue_from=”D:\Program Files (x86)\train1\chisim.lstm” –train_listfile=”D:\Program Files (x86)\train1\chitraing.txt” –traineddata=”D:\Program Files (x86)\train1\chi_sim.traineddata”

7. Synthesize a new trainedData file

lstmtraining –stop_training –continue_from=”D:\Program Files (x86)\train1\output_checkpoint” –traineddata=”D:\Program Files (x86)\train1\chi_sim.traineddata” –model_output=”D:\Program Files (x86)\train1\test.traineddata”

8. Move the new TrainedData file

A file named test. trainedData will be generated under the folder. We will copy it to the TessData folder of Tesseract-OCR, and then we can use it as a language for word recognition.