Train the Tesseract model with jTessBoxEditor

Guide language:

In this article, we will talk about how to use tesseract to train the numbers written by hand. After reading the tesseract, you can draw inferences from other languages such as Chinese or other characters or even symbols to train yourself. The methods are universal.

Prerequisites:

JTessBoxEditor: jTessBoxEditor: jTessBoxEditor: jTessBoxEditor: jTessBoxEditor: jTessBoxEditor: jTessBoxEditor Install Tesseract 3. Unrecognized picture (here I wrote a picture 12345 by hand, and it was identified as 1365, using eng language library by default)

As follows:

Results:

tesseract 12345.png stdout
1365
Copy the code

Training steps:

1. Open jTessBoxEditor and enter on the command line

java -jar jTessBoxEditor
Copy the code

An interactive window pops up

2. Choose Tools->Merge TIFF

[fontname]. Exp [num]. Tif lang- language fontName – font num- custom number

Num.test.exp0.tif = num.test.exp0.tif = num.test.exp0.tif = num.test.exp0.tif This name will be used repeatedly in subsequent steps.

When done, the jTessBoxEditor tool will display the following popup box and generate the num.test.exp0.tif file in the current train directory.

4. Generate a box file. Box is to box each recognized text

Tesseract num.test.exp0.tif num.test.exp0 -- PSM 3 Batch. nochop makeboxCopy the code

A num.test.exp0. Box file is generated on success

JTessBoxEdtor ->Box Editor->Open num. Test. Exp0. tif, num. Test. So put them all in the same directory. Fill in the Char column for the unrecognized 2 and replace the misrecognized 4 with a 6. When you’re done, press Enter and Save.

Create an empty file and fill test 0 0 0 0 0

Test 0 0 0 0 0 0: <fontname> <italic> <bold> <fixed> <serif> <fraktur> "Fontname" is the fontname, "italic" is the fontname, "bold" is the fontname, "fixed" is the default font, "serif" is the serif font, "fraktur" is the German black font, "1" and "0" represent "yes" and "no"Copy the code

You can also use the command:

echo test 0 0 0 0 0 > font_properties
Copy the code

Remember that the font name must be the same as test in num.test.exp0. 7. Generate training files

tesseract num.test.exp0.tif num.test.exp0 nobatch box.train
Copy the code

8. Generate a character set

unicharset_extractor zwp.test.exp0.box
Copy the code

The unicharset file is generated on success

One thing to note here is that if you install tesseract without a training tool, you will be prompted

zsh: command not found: unicharset_extractor
Copy the code

See another blog post: ZSH: Command Not Found: unicharset_Extractor

9. Generate a polychar signature file

Enter the mftraining -f font_properties -u unicharset -o num.unicharset num.test.exp0.tr commandCopy the code

After the command is executed successfully, the intTemp, pffmtable, shapetable, and num. Unicharset files are generated

10. Generate a character normalization feature file

cntraining num.test.exp0.tr
Copy the code

A normProto file is generated after the command is successfully executed

11. Rename the file

mv normproto num.normproto
mv inttemp num.inttemp
mv pffmtable num.pffmtable
mv shapetable num.shapetable
Copy the code

12. Merge training documents

combine_tessdata num.
Copy the code

Num. Traineddata (num. Traineddata) Yes, this is the final training set, put this in the Tesseract tessData directory. May be/usr/local/Cellar/tesseract / 4.4.1 / share/tessdata may also be/usr/local/share/tessdata

13. Final Results:

Conclusion:

Tesseract 4.1.1 tesseract4.1.1 tesseract4.1.1 Tesseract 4.1.1 Tesseract 4.1.1 Tesseract 4.1.1 Tesseract 4.1.1 Tesseract 4.1.1 — with-train-tools, you need to build and install your own training tools, see blog 2. In fact, most of the commands in the whole step can be completed with scripts. Only jTessBoxEditor annotation correction results need to be used in the interface. If you can make jTessBoxEditor annotation result into a tool, the rest commands can be automatically executed after correction, and batch training can be done.

Train the Tesseract model with jTessBoxEditor

Guide language:

Prerequisites:

Training steps:

Conclusion:

Related Posts

LSTM network

Deep learning -TF function -layers.concatenate uses numpy array dimensions

“Face recognition series tutorial” 2 machine learning classic loss function