- Scotland team
- Author: MaDaima
preface
Some time ago, we found that if we could identify the text in some pictures of the same standard, it would improve a lot of our user experience. So I looked at the functions related to image and text recognition.
The first is to understand the API of each major factory, the recognition rate is naturally very high, the effect is very good, but the problem is, limit the number of calls and the model is not subject to their own control, or you need to charge money to make you stronger. Java- Image text recognition based on Baidu API (support Chinese, English and Chinese mixed) – TheLoveFromWSK – CSDN blog)
Then I learned about OCR and Tesseract for myself. Tesseract helloWorld tesserAct HelloWorld TesserAct So I asked the big guy to pat me
introduce
OCR: Optical Character Recognition (OCR) refers to the process of analyzing and recognizing image files of text data and obtaining text and layout information.
Tesseract-ocr: Tesseract is an OCR engine developed by Ray Smith in HP Bristol Lab from 1985 to 1995. Tesseract was ranked first in 1995 UNLV accuracy test. However, development was basically halted after 1996. In 2006, Google brought Smith on board to revive the project. The current project license is Apache 2.0. The project currently supports mainstream platforms such as Windows, Linux and Mac OS. But as an engine, it only provides command-line tools.
Tesseract installation
GitHub – tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (Main Repository) Tesseract Open Source OCR Engine Tesseract-ocr/Tesseract Wiki · GitHub
windows
1. After the download is complete, directly execute the exe file and install it to the corresponding folder.
2. Configure environment variables.
Add environment variables in Path to the tesseract-OCr installation directory
3. Add the TessData system variable
Add TESSDATA_PREFIX to the system variable as the TessData folder in the tesseract-OCr directory
4. Enter CMD to check whether the installation is complete
tesseract -v
Copy the code
Using a Tesseract
1. Prepare a PNG image,JPG is ok (specific support for which format is not a list, I will use PNG)
tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]
Copy the code
Tesseract [image address + name + suffix] [output address + name]
3. This will create a demoanswer.txt file in the directory
But what if it’s in Chinese? Repeat the above procedure
4. Chinese recognition
tesseract imagename outputbase [-l lang]
Copy the code
[-l lang] in this command is set to identify the language, the default is English, so if you need to identify Chinese also need to download the Chinese character library. Tesseract-ocr = tesseract-OCR = tesseract-OCR = tesseract-OCR = tesseract-OCR = tesseract-OCR = tesseract-OCR
Data Files · Tesseract-OCr/Tesseract Wiki · GitHub
Place the downloaded font library in the TessData directory
4.2 the command
tesseract HelloWorldDemo.png DemoAnswer -l chi_sim
Copy the code
This is because the font library used is different from our font, so if we want to recognize the font in our picture, we need to train our own font library
Training model
Steps :(from tesseract’s github)
Prepare training text
Render text to image + box file. (Or create hand-made box files for existing image data.) Convert the text to an image+box file (if you already have an image file, just generate the box file manually)
Make unicharset file. Generate unicharset file
Optionally make dictionary data. Selectively generate dictionary data
Run tesseract to process image + box file to make training data set. Run tesseract to process the previous image+box file to generate a training data set
Run training on training data set. Training on the basis of training data set
Combine data files
1. Change the name of the image to [lang].[fontName].exp[num].tif.
For example: BlackLang HelloWorldDemo. Exp0. Tif
2. Generate the Box file
Here’s a picture from the website
tesseract BlackLang.HelloWorldDemo.exp0.tif BlackLang.HelloWorldDemo.exp0 -l chi_sim batch.nochop makebox
Copy the code
This will generate a.box file in the target directory, as shown below
3. Use jTessBoxEditor
JTessBoxEditor: vietocr-Browse /jTessBoxEditor at SourceForge.net
Once installed, open jTessBoxEditor, click Open, and select the corresponding tif (tif and box are not in the same directory)
3.2 In Box View, you can see an enlarged View of each word to confirm whether it is fully selected
4. Training
4.1 Generating tr Files
Exe [tif image file name] [generated tr file name] nobatch box.train t
tesseract.exe BlackLang.HelloWorldDemo.exp0.tif BlackLang.HelloWorldDemo.exp0 nobatch box.train
Copy the code
Enter command: unicharset_extractor.exe [box filename]
unicharset_extractor.exe BlackLang.HelloWorldDemo.exp0.box
Copy the code
Here, if you have multiple images to generate a font, you need to combine them into a single char set
Unicharset_extractor. exe [first box file name] [second box file name]… here demo uses only one box
4.3 Defining the font feature file font_properties
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
Copy the code
Where tontName must be consistent with [lang].[fontName].exp[num]. Box; The value of < ITALic >, <bold>, <fixed>, <serif>, and < Fraktur > is 1 or 0, indicating whether the font has these properties.
Here we create a file in the directory named font_properties with the following contents:
HelloWorldDemo 0 0 0 0 0
Copy the code
This is our directory of files as follows: a target image. tif, an analysis. box, a training result. tr, a character set unicharset, a feature file font_properties
4.4 Generating a Dictionary
mftraining.exe -F font_properties -U unicharset -O BlackLang.unicharset BlackLang.HelloWorldDemo.exp0.tr
If you have multiple tr's, you can write multiple tr's
Copy the code
cntraining.exe BlackLang.HelloWorldDemo.exp0.tr
If you have multiple tr's, you can write multiple tr's
Copy the code
As shown in figure:
If you are interested, you can open it in the editor and see what is inside, you will find some interesting gibberish in it hahaha
4.5 Run the following command to merge files:
combine_tessdata BlackLang.
#combine_tessdata [lang].
Copy the code
Then copy [lang].trainedData into tesseract-OCR tessData to execute:
tesseract HelloWorldDemo.png answer -l BlackLang
Copy the code
The end of the
This is only a very demo training, and it is not enough to use it in production normally. It requires a lot of training for all the characters of a certain font. Generally speaking, Tesseract’s recognition ability of Chinese is still very poor, let alone handwritten Chinese, so the recognition rate is basically no. But the recognition rate of English and numbers is quite impressive. This is just a simple record of tesseract sharing
There are a lot of packaged OCRs out there like PyOCR, tess4J, or any of the apis that are provided by the major manufacturers that have a very high recognition rate because the model is already well trained, but I’ll leave that for another time