Abstract
Java + OCR extract image text. In the e-commerce platform, the platform may need to audit some business licenses and ID cards of merchants, and we may need to extract the numbers in the certificates. When a small number of businesses, can be manually audited, the document in the number extracted; As the number of merchants increases, platform audits become labor-intensive. At this point, it naturally occurred to us that we could extract the text from the image. Let’s take a look at OCR image text extraction.
SpringBoot actual combat e-commerce project mall4j address: gitee.com/gz-yami/mal…
Tesseract_OCR image text extraction for the first time
Can call the third put interface, image text extraction, such as: Baidu API
Here we ourselves in the local to do a picture text extraction, using Java+Tesseract way
Environment preparation win10, JDK1.8, IDEA, Maven
To download tesseract3.02, you must first have the Java environment JDK
1. Download and install Tesseract3.02
2. Download chi_sim. trainedData
Baidu cloud: link: pan.baidu.com/s/1VYx6zbKA… Extraction code: HSR7
3. Place chi_sim. trainedData in the tessData directory
4. Open the digits file under \ tessData \configs and select tessedit_char_whitelist 0123456789-. Tessedit_char_whitelist instead of 0123456789 x.
5. Configure environment variables (tesseract3.02 installation directory)
6. Check whether the installation is complete
WIN+R Open CMD command
Enter a tesseract
If yes, the installation is complete
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile…]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.
Single options:
-v –version: version info
–list-langs: list available languages for tesseract engine
Extracting text
tesseract C:/pic/1.png C:/pic/1 -l chi_sim
Image of disk C, generate 1.txt (extracted text content)
-l: not 1. It is the lowercase letter L corresponding to the English letter L
Chi_sim Chinese thesaurus
Import dependencies in Java projects
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.21.</version>
<exclusions>
<exclusion>
<groupId>com.sun.jna</groupId>
<artifactId>jna</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
<version>4.1. 0</version> </dependency> <! -- https://mvnrepository.com/artifact/javax.media/jai_imageio -->
<dependency>
<groupId>javax.media</groupId>
<artifactId>jai_imageio</artifactId>
<version>1.11.</version> </dependency> <! -- https://mvnrepository.com/artifact/org.swinglabs/swingx -->
<dependency>
<groupId>org.swinglabs</groupId>
<artifactId>swingx</artifactId>
<version>1.61.</version>
</dependency>
Copy the code
Code ImageIOHelper. Class
package com.example.excel.excelexportodb.Utils;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.Locale;
import javax.imageio.IIOImage;
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.ImageWriteParam;
import javax.imageio.ImageWriter;
import javax.imageio.metadata.IIOMetadata;
import javax.imageio.stream.ImageInputStream;
import javax.imageio.stream.ImageOutputStream;
import com.github.jaiimageio.plugins.tiff.TIFFImageWriteParam;
//import com.sun.media.imageio.plugins.tiff.TIFFImageWriteParam;
/** * class description: Create temporary image file to prevent damage to the original file */
public class ImageIOHelper {
// Set the language
private Locale locale=Locale.CHINESE;
// Custom language constructs
public ImageIOHelper(Locale locale){
this.locale=locale;
}
Locale.chinese is the default constructor
public ImageIOHelper(a){}/** * Create a temporary image file to prevent damage to the original file *@param imageFile
* @param imageFormat like png,jps .etc
* @return TempFile of Image
*/
public File createImage(File imageFile, String imageFormat) throws IOException {
// Read the image file
Iterator<ImageReader> readers = ImageIO.getImageReadersByFormatName(imageFormat);
ImageReader reader = readers.next();
// Get the file stream
ImageInputStream iis = ImageIO.createImageInputStream(imageFile);
reader.setInput(iis);
IIOMetadata streamMetadata = reader.getStreamMetadata();
/ / set writeParam
TIFFImageWriteParam tiffWriteParam = new TIFFImageWriteParam(Locale.CHINESE);
tiffWriteParam.setCompressionMode(ImageWriteParam.MODE_DISABLED);
// Set whether compression can be performed
// Get tiffWriter and set output
Iterator<ImageWriter> writers = ImageIO.getImageWritersByFormatName("tiff");
ImageWriter writer = writers.next();
BufferedImage bi = reader.read(0);
IIOImage image = new IIOImage(bi,null,reader.getImageMetadata(0));
File tempFile = tempImageFile(imageFile);
ImageOutputStream ios = ImageIO.createImageOutputStream(tempFile);
writer.setOutput(ios);
writer.write(streamMetadata, image, tiffWriteParam);
ios.close();
iis.close();
writer.dispose();
reader.dispose();
return tempFile;
}
/** * suffixes tempFile *@param imageFile
* @throws IOException
*/
private File tempImageFile(File imageFile) throws IOException {
String path = imageFile.getPath();
StringBuffer strB = new StringBuffer(path);
strB.insert(path.lastIndexOf('. '),"_text_recognize_temp");
String s=strB.toString().replaceFirst("(? < = / /. (//w+)$"."tif");
// Set file hiding
Runtime.getRuntime().exec("attrib "+"\" "+s+"\" "+" +H");
return newFile(strB.toString()); }}Copy the code
OCRUtil.class
package com.example.excel.excelexportodb.Utils;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.Locale;
import org.jdesktop.swingx.util.OS;
/** * Class description: OCR tool class */
public class OCRUtil {
private final String LANG_OPTION = "-l";
// The letter "l" is not the Arabic number "1"
private final String EOL = System.getProperty("line.separator");
/** * OCR installation path */
private String tessPath = "D: / / tess4j / / tess4j - SRC / / tesseract - OCR - set - 3.0 / / tesseract - the OCR";
public OCRUtil(String tessPath,String transFileName){
this.tessPath=tessPath;
}
//OCRUtil constructor, default path is "C://Program Files (x86)// tesseract-ocr"
public OCRUtil(a){}public String getTessPath(a) {
return tessPath;
}
public void setTessPath(String tessPath) {
this.tessPath = tessPath;
}
public String getLANG_OPTION(a) {
return LANG_OPTION;
}
public String getEOL(a) {
return EOL;
}
/ * * *@returnThe recognized text */
public String recognizeText(File imageFile,String imageFormat)throws Exception{
File tempImage = new ImageIOHelper().createImage(imageFile,imageFormat);
return ocrImages(tempImage, imageFile);
}
/** * you can customize the language */
public String recognizeText(File imageFile,String imageFormat,Locale locale)throws Exception{
File tempImage = new ImageIOHelper(locale).createImage(imageFile,imageFormat);
return ocrImages(tempImage, imageFile);
}
/ * * *@param
* @param
* @returnIdentified content *@throws IOException
* @throws InterruptedException
*/
private String ocrImages(File tempImage,File imageFile) throws IOException, InterruptedException{
// Set the output file directory and file name
File outputFile = new File(imageFile.getParentFile(),"test");
StringBuffer strB = new StringBuffer();
// Set the command line content
List<String> cmd = new ArrayList<String>();
if(OS.isWindowsXP()){
cmd.add(tessPath+"//tesseract");
}else if(OS.isLinux()){
cmd.add("tesseract");
}else{
cmd.add(tessPath+"//tesseract");
}
cmd.add("");
cmd.add(outputFile.getName());
cmd.add(LANG_OPTION);
/ / Chinese package
cmd.add("chi_sim");
// Use the following formula
cmd.add("equ");
/ / English package
cmd.add("eng");
// Create an operating system process
ProcessBuilder pb = new ProcessBuilder();
// Sets the working directory for this process generator
pb.directory(imageFile.getParentFile());
cmd.set(1, tempImage.getName());
// Sets the CMD command to execute
pb.command(cmd);
// Set error output generated by subsequent child processes to be merged with standard output
pb.redirectErrorStream(true);
long startTime = System.currentTimeMillis();
System.out.println("Start time:" + startTime);
// Start execution and return the process instance
Process process = pb.start();
PNG test -l chi_sim+equ+eng
// Input/output stream optimization
// printMessage(process.getInputStream());
// printMessage(process.getErrorStream());
int w = process.waitFor();
// Delete temporary working files
tempImage.delete();
if(w==0) {// 0 indicates normal exit
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(outputFile.getAbsolutePath()+".txt"),"UTF-8"));
String str;
while((str = in.readLine())! =null){
strB.append(str).append(EOL);
}
in.close();
long endTime = System.currentTimeMillis();
System.out.println("End time:" + endTime);
System.out.println("Time:" + (endTime - startTime) + "毫秒");
}else{
String msg;
switch(w){
case 1:
msg = "Errors accessing files.There may be spaces in your image's filename.";
break;
case 29:
msg = "Cannot recongnize the image or its selected region.";
break;
case 31:
msg = "Unsupported image format.";
break;
default:
msg = "Errors occurred.";
}
tempImage.delete();
throw new RuntimeException(msg);
}
// Delete the temporary file from which the text is extracted
new File(outputFile.getAbsolutePath()+".txt").delete();
return strB.toString().replaceAll("\\s*"."");
}
// public static void main(String[] args) throws IOException, InterruptedException {
// String cmd = "cmd /c dir c:\\windows";
// final Process process = Runtime.getRuntime().exec(cmd);
// printMessage(process.getInputStream());
// printMessage(process.getErrorStream());
// int value = process.waitFor();
// System.out.println(value);
/ /}
private static void printMessage(final InputStream input) {
new Thread(new Runnable() {
@Override
public void run(a) {
Reader reader = new InputStreamReader(input);
BufferedReader bf = new BufferedReader(reader);
String line = null;
try {
while((line = bf.readLine()) ! =null) { System.out.println(line); }}catch(IOException e) { e.printStackTrace(); } } }).start(); }}Copy the code
The main function
public static void main(String[] args) throws TesseractException {
String path = "D://1.png";
System.out.println("ORC Test Begin......");
try {
String valCode = new OCRUtil().recognizeText(new File(path), "png");
System.out.println(valCode);
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("ORC Test End......");
}
Copy the code
Identify the image
ORC Test Begin……
To test the 123
ORC Test End……
The final recognition effect is general, some complex character recognition appears garbled, this time we need to train the thesaurus
The more you train, the more accurate you get
SpringBoot actual combat e-commerce project mall4j address: gitee.com/gz-yami/mal…