Guide language:
In this article, we will talk about how to use tesseract to train the numbers written by hand. After reading the tesseract, you can draw inferences from other languages such as Chinese or other characters or even symbols to train yourself. The methods are universal.
Prerequisites:
JTessBoxEditor: jTessBoxEditor: jTessBoxEditor: jTessBoxEditor: jTessBoxEditor: jTessBoxEditor: jTessBoxEditor Install Tesseract 3. Unrecognized picture (here I wrote a picture 12345 by hand, and it was identified as 1365, using eng language library by default)
As follows:
Results:
tesseract 12345.png stdout
1365
Copy the code
Training steps:
1. Open jTessBoxEditor and enter on the command line
java -jar jTessBoxEditor
Copy the code
An interactive window pops up
2. Choose Tools->Merge TIFF
[fontname]. Exp [num]. Tif lang- language fontName – font num- custom number
Num.test.exp0.tif = num.test.exp0.tif = num.test.exp0.tif = num.test.exp0.tif This name will be used repeatedly in subsequent steps.
When done, the jTessBoxEditor tool will display the following popup box and generate the num.test.exp0.tif file in the current train directory.
4. Generate a box file. Box is to box each recognized text
Tesseract num.test.exp0.tif num.test.exp0 -- PSM 3 Batch. nochop makeboxCopy the code
A num.test.exp0. Box file is generated on success
JTessBoxEdtor ->Box Editor->Open num. Test. Exp0. tif, num. Test. So put them all in the same directory. Fill in the Char column for the unrecognized 2 and replace the misrecognized 4 with a 6. When you’re done, press Enter and Save.
Create an empty file and fill test 0 0 0 0 0
Test 0 0 0 0 0 0: <fontname> <italic> <bold> <fixed> <serif> <fraktur> "Fontname" is the fontname, "italic" is the fontname, "bold" is the fontname, "fixed" is the default font, "serif" is the serif font, "fraktur" is the German black font, "1" and "0" represent "yes" and "no"Copy the code
You can also use the command:
echo test 0 0 0 0 0 > font_properties
Copy the code
Remember that the font name must be the same as test in num.test.exp0. 7. Generate training files
tesseract num.test.exp0.tif num.test.exp0 nobatch box.train
Copy the code
8. Generate a character set
unicharset_extractor zwp.test.exp0.box
Copy the code
The unicharset file is generated on success
One thing to note here is that if you install tesseract without a training tool, you will be prompted
zsh: command not found: unicharset_extractor
Copy the code
See another blog post: ZSH: Command Not Found: unicharset_Extractor
9. Generate a polychar signature file
Enter the mftraining -f font_properties -u unicharset -o num.unicharset num.test.exp0.tr commandCopy the code
After the command is executed successfully, the intTemp, pffmtable, shapetable, and num. Unicharset files are generated
10. Generate a character normalization feature file
cntraining num.test.exp0.tr
Copy the code
A normProto file is generated after the command is successfully executed
11. Rename the file
mv normproto num.normproto
mv inttemp num.inttemp
mv pffmtable num.pffmtable
mv shapetable num.shapetable
Copy the code
12. Merge training documents
combine_tessdata num.
Copy the code
Num. Traineddata (num. Traineddata) Yes, this is the final training set, put this in the Tesseract tessData directory. May be/usr/local/Cellar/tesseract / 4.4.1 / share/tessdata may also be/usr/local/share/tessdata
13. Final Results:
Conclusion:
Tesseract 4.1.1 tesseract4.1.1 tesseract4.1.1 Tesseract 4.1.1 Tesseract 4.1.1 Tesseract 4.1.1 Tesseract 4.1.1 Tesseract 4.1.1 — with-train-tools, you need to build and install your own training tools, see blog 2. In fact, most of the commands in the whole step can be completed with scripts. Only jTessBoxEditor annotation correction results need to be used in the interface. If you can make jTessBoxEditor annotation result into a tool, the rest commands can be automatically executed after correction, and batch training can be done.