ocr
Optical Character Recognition (OCR) refers to the process of analyzing and recognizing image files of text data and obtaining text and layout information.
Tesseract
Tesseract Tesseract is an OCR engine developed by Ray Smith at HP’s Bristol Laboratory between 1985 and 1995, and was ranked first in the 1995 UNLV accuracy test. However, development was basically halted after 1996. In 2006, Google brought Smith on board to revive the project. The current project license is Apache 2.0. The project currently supports mainstream platforms such as Windows, Linux and Mac OS. But as an engine, it only provides command-line tools. Tesseract, currently maintained by Google, is one of the best open source OCR engines and supports Chinese.
Tess – Two is a Tesseract port for Android.
Download Tess – two:
compile 'com. Rmtheis: Tess - two: 8.0.0'
Copy the code
Then put the trained ENG. trainedData into assets folder of android project, and you can recognize English.
1. Simply recognize English
Initialize Tess – Two and load trained TessData
private void prepareTesseract(a) {
try {
prepareDirectory(DATA_PATH + TESSDATA);
} catch (Exception e) {
e.printStackTrace();
}
copyTessDataFiles(TESSDATA);
}
/**
* Prepare directory on external storage
*
* @param path
* @throws Exception
*/
private void prepareDirectory(String path) {
File dir = new File(path);
if(! dir.exists()) {if(! dir.mkdirs()) { Log.e(TAG,"ERROR: Creation of directory " + path + " failed, check does Android Manifest have permission to write to external storage."); }}else {
Log.i(TAG, "Created directory "+ path); }}/**
* Copy tessdata files (located on assets/tessdata) to destination directory
*
* @param path - name of directory with .traineddata files
*/
private void copyTessDataFiles(String path) {
try {
String fileList[] = getAssets().list(path);
for (String fileName : fileList) {
// open file within the assets folder
// if it is not already there copy it to the sdcard
String pathToDataFile = DATA_PATH + path + "/" + fileName;
if(! (new File(pathToDataFile)).exists()) {
InputStream in = getAssets().open(path + "/" + fileName);
OutputStream out = new FileOutputStream(pathToDataFile);
// Transfer bytes from in to out
byte[] buf = new byte[1024];
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}
in.close();
out.close();
Log.d(TAG, "Copied " + fileName + "to tessdata"); }}}catch (IOException e) {
Log.e(TAG, "Unable to copy files to tessdata "+ e.toString()); }}Copy the code
After the photo is taken, call the startOCR method.
private void startOCR(Uri imgUri) {
try {
BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 4; // 1 - means max size. 4 - means maxsize/4 size. Don't use value <4, because you need more memory in the heap to store your data.
Bitmap bitmap = BitmapFactory.decodeFile(imgUri.getPath(), options);
String result = extractText(bitmap);
resultView.setText(result);
} catch(Exception e) { Log.e(TAG, e.getMessage()); }}Copy the code
ExtractText () calls Tess – Two’s API to implement OCR word recognition.
private String extractText(Bitmap bitmap) {
try {
tessBaseApi = new TessBaseAPI();
} catch (Exception e) {
Log.e(TAG, e.getMessage());
if (tessBaseApi == null) {
Log.e(TAG, "TessBaseAPI is null. TessFactory not returning tess object.");
}
}
tessBaseApi.init(DATA_PATH, lang);
tessBaseApi.setImage(bitmap);
String extractedText = "empty result";
try {
extractedText = tessBaseApi.getUTF8Text();
} catch (Exception e) {
Log.e(TAG, "Error in recognizing text.");
}
tessBaseApi.end();
return extractedText;
}
Copy the code
Finally, show the effect of identification, the effect at this time is ok.
2. Identification code
Next, try to identify a piece of code using the program above.
At this point, the effect is a mess. Let’s refactor startOCR() to add local binarization.
private void startOCR(Uri imgUri) {
try {
BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 4; // 1 - means max size. 4 - means maxsize/4 size. Don't use value <4, because you need more memory in the heap to store your data.
Bitmap bitmap = BitmapFactory.decodeFile(imgUri.getPath(), options);
CV4JImage cv4JImage = new CV4JImage(bitmap);
Threshold threshold = new Threshold();
threshold.adaptiveThresh((ByteProcessor)(cv4JImage.convert2Gray().getProcessor()), Threshold.ADAPTIVE_C_MEANS_THRESH, 12.30, Threshold.METHOD_THRESH_BINARY);
Bitmap newBitmap = cv4JImage.getProcessor().getImage().toBitmap(Bitmap.Config.ARGB_8888);
ivImage2.setImageBitmap(newBitmap);
String result = extractText(newBitmap);
resultView.setText(result);
} catch(Exception e) { Log.e(TAG, e.getMessage()); }}Copy the code
Here, CV4J is used to realize the binarization of the image.
CV4JImage cv4JImage = new CV4JImage(bitmap);
Threshold threshold = new Threshold();
threshold.adaptiveThresh((ByteProcessor)(cv4JImage.convert2Gray().getProcessor()), Threshold.ADAPTIVE_C_MEANS_THRESH, 12.30, Threshold.METHOD_THRESH_BINARY);
Bitmap newBitmap = cv4JImage.getProcessor().getImage().toBitmap(Bitmap.Config.ARGB_8888);
Copy the code
Image binarization is to set the gray value of pixels on the image to 0 or 255, that is, the whole image presents an obvious black and white effect. The binarization of the image is beneficial to the further processing of the image, which makes the image become simple, and the amount of data is reduced, which can highlight the contour of the object of interest.
Cv4j github address: github.com/imageproces…
Cv4j is an image processing library developed by Hyperyfish and me, with a pure Java implementation.
Try the effect again. The middle part of the image is the binarization effect, and the content of the code is basically recognized.
3. Recognize Chinese
If you want to recognize Chinese fonts, you need to use Chinese packets. You can download it from the website below.
Github.com/tesseract-o…
The data packets related to Chinese are chi_sim. trainedData and chi_tra. Traineddata, which indicate simplified Chinese and traditional Chinese respectively.
tessBaseApi.init(DATA_PATH, lang);
Copy the code
The previous examples are for English, so the original lang value was “eng”. Now to recognize simplified Chinese, you need to change the value to “chi_sim”.
The last
This project is only a demo level demonstration, far from being used in a production environment. Github address of this project: github.com/fengzhizi71…
Why is it demo level?
- The data package is very large, especially the Chinese data package is about 50 M, which is definitely not suitable for the mobile terminal. Generally, the right thing to do is to put it in the cloud.
- Text recognition is slow, especially Chinese, and there is a lot of engineering room for improvement.
- There is a lot of pre-processing to do before OCR. In this example, only binarization is used. In fact, there are many pre-processing steps such as tilt correction, character cutting and so on.
- To improve Tess – Two’s recognition rate, you can train your own data sets.