preface
Tesseract-ocr is a free open source image OCR word recognition engine. Tesseract was originally developed by HP LABS, and later contributed to the open source software industry, which was improved, bug modified, optimized, and re-released by Google. It will convert the text in the image you want to recognize into text at your command. So far, it has supported simplified Chinese, traditional Chinese, English, Japanese, Korean and more than 60 languages recognition.
Environment to prepare
1. Install TesserAct 4.1
-
Digi.bib.uni-mannheim. de/tesseract/
Go to the download page and select Tesseract-Ocr-w64-setup-v4.1.0.201903.exe
- During the installation process, click “Next” all the time. The installation path can be customized, and the language library does not need to be checked. Anyway, you need to download the best model on GitHub.
2. Download tessdata_best
What is tessData_best?
Tessdata_best is the best training model for tesseract’s LSTM technology.
- Download: github.com/tesseract-o…
-
Unzip the following four files and copy and paste them into the tesseract-ocr \ tessData directory
- chi_sim.traineddata
- chi_sim_vert.traineddata
- chi_tra.traineddata
- chi_tra_vert.traineddata
3. Install pytesseract
To use tesseract functionality in Python code, PyTesserAct is installed using PIP
-
The test code is as follows:
from PIL import Image import pytesseract text = pytesseract.image_to_string(Image.open(r'D:\train\1.jpg'),lang='chi_sim') print(text) Copy the code
-
Errors that may be encountered when using PyTesserAct
raise TesseractNotFoundError() pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path Copy the code
Tesseract is not installed or not in PATH
Pytesseract after pyTesseract is installed, a pytesseract folder will be generated in the Site-Packges directory of Python Lib. Find Pytesseract.py in the folder and find the following code
tesseract_cmd = 'tesseract' Copy the code
Tesseract_cmd = ‘tesseract’; tesseract_cmd = ‘D:\ tesseract-ocr \tesseract.exe’;
The tesseract_cmd configuration is the absolute path where you installed the tesseract, so you can find the tesseract.
4. Configure environment variables
Step 1. Add the installation Path to the Path of the system environment variable
Step 2: Add TessData (the trained font file) to the system variable
The variable name is TESSDATA_PREFIX
The variable value is the path to the tessData directory
After the configuration is complete, enter tesseract -v. If the following information is displayed, the environment variables are successfully configured
Basic instructions:
Tesseract –version Displays the tesseract version
Tesseract –list-langs Views the language libraries currently contained in tesseract
5. Download jTessBoxEditor
Why download jTessBoxEditor?
Tesseract training requires jTessBoxEditor to compile tif files from samples. This software requires a JAVA environment (JRE), which is easy to install and configure environment variables.
Download address: sourceforge.net/projects/vi…
Go to the download page and select “jtessBoxEditor-2.1.1.zip”
After decompressing, double-click jtessBoxEditor.jar to use it. The interface is as follows
Simple conversion
1. Conversion process
Prepare an image file, such as 1.png
To switch the command line to the target image file directory, for example we convert the file to 1.png (image files allow multiple formats), located in D:\train; Then enter it on the command line
tesseract 1.jpg test -l chi_sim --psm 7
Copy the code
Operation as shown below
Open the TXT file and view the output
2. Command interpretation
Tesseract Image name Output file name -l font file -pSM pagesegMode
-
-l chi_sim indicates that simplified Chinese character database is used. (You need to download the Chinese character database file, decompress it, and save it to the TessData directory with the extension name of the character database file. Raineddata The simplified Chinese character database file is named chi_sim.
-
– PSM 7 Indicates that the test. PNG image is a line of text. This parameter reduces the identification error rate. The default is 3
Page segmentation modes (–psm) Page segmentation mode (translation) 0 Orientation and script detection (OSD) only. Direction and Script Detection only (OSD) 1 Automatic page segmentation with OSD. Use OSD automatic paging 2 Automatic page segmentation, but no OSD, or OCR. (not implemented) Automatic page segmentation, but no OSD or OCR 3 Fully automatic page segmentation, but no OSD. (Default) Fully automatic page splitting, but no OSD (default) 4 Assume a single column of text of variable sizes. Imagine a list of variable-size text 5 Assume a single uniform block of vertically aligned text. Assume a uniform block of vertically aligned text 6 Assume a single uniform block of text. Assume a uniform block of text 7 Treat the image as a single text line. Treat the image as a single line of text 8 Treat the image as a single word. Think of the image as a single word 9 Treat the image as a single word in a circle. Think of the image as a single word in a circle 10 Treat the image as a single character. Treat the image as a single character 11 Sparse text. Find as much text as possible in no particular order. Sparse text. Find as much text as you can, in no particular order 12 Sparse text with OSD. Sparse text with OSD 13 Raw line. Treat the image as a single text line, Treat the image as a single line of text
LSTM training process
Introduction to the
Tesseract 4 includes a new neural network-based recognition engine that offers a significant improvement over previous versions in the accuracy of document image recognition. Because tesseract’s Chinese language package “CHI_sim” has a low recognition accuracy rate for Chinese handwriting fonts or pictures in a complex environment, it is necessary to use its own samples for specific training to improve the recognition rate. Through training, it can also form its own language library.
The training process
1. Generate a TIF file
- Use jTessBoxEditor to merge multiple images to validate and generate a TIF
To perform subsequent operations, run the [lang].[fontname].exp[num].tif command to merge the file name
- Lang stands for the language name
- Fontname indicates the fontname
- Num stands for serial number
We save the tif file name to nmL.num.exp0.tif
2. Generate tif image box file
tesseract nml.num.exp3.tif nml.num.exp3 -l chi_sim batch.nochop makebox
Running the above command will generate a file named NML. Num. Exp3. box in our folder.
3. Add and modify box files
Open the previous JTessBoxEdit software, click the Box Editor TAB, and load the TIF file to modify the Box (Box file and TIF file need to be in the same folder).
4. Generate an LSTMF file
tesseract nml.num.exp3.tif nml.num.exp3 -l chi_sim –psm 6 lstm.train
Running the above command will generate a file named NmL.num. Exp3. LSTMF in our folder.
5. Extract the LSTM file of the language
In step 2 of environment preparation, extract the. LSTM file from the. Traineddata file in tessDatA_best and copy the downloaded. Traineddata file to the training folder
combine_tessdata -e chi_sim.traineddata chisim.lstm
Running the above command will generate a file named chisim.lstm in our folder.
6. Start LSTM training
Note: We need to create a new text file called chitraing.txt and fill it with the absolute path of the LSTMF file generated in step 4 (as shown below)
lstmtraining –model_output=”D:\Program Files (x86)\train1\output” –continue_from=”D:\Program Files (x86)\train1\chisim.lstm” –train_listfile=”D:\Program Files (x86)\train1\chitraing.txt” –traineddata=”D:\Program Files (x86)\train1\chi_sim.traineddata”
7. Synthesize a new trainedData file
lstmtraining –stop_training –continue_from=”D:\Program Files (x86)\train1\output_checkpoint” –traineddata=”D:\Program Files (x86)\train1\chi_sim.traineddata” –model_output=”D:\Program Files (x86)\train1\test.traineddata”
8. Move the new TrainedData file
A file named test. trainedData will be generated under the folder. We will copy it to the TessData folder of Tesseract-OCR, and then we can use it as a language for word recognition.