Abstract

Java + OCR extract image text. In the e-commerce platform, the platform may need to audit some business licenses and ID cards of merchants, and we may need to extract the numbers in the certificates. When a small number of businesses, can be manually audited, the document in the number extracted; As the number of merchants increases, platform audits become labor-intensive. At this point, it naturally occurred to us that we could extract the text from the image. Let’s take a look at OCR image text extraction.

SpringBoot actual combat e-commerce project mall4j address: gitee.com/gz-yami/mal…

Tesseract_OCR image text extraction for the first time

Can call the third put interface, image text extraction, such as: Baidu API

Here we ourselves in the local to do a picture text extraction, using Java+Tesseract way

Environment preparation win10, JDK1.8, IDEA, Maven

To download tesseract3.02, you must first have the Java environment JDK

1. Download and install Tesseract3.02

2. Download chi_sim. trainedData

Baidu cloud: link: pan.baidu.com/s/1VYx6zbKA… Extraction code: HSR7

3. Place chi_sim. trainedData in the tessData directory

4. Open the digits file under \ tessData \configs and select tessedit_char_whitelist 0123456789-. Tessedit_char_whitelist instead of 0123456789 x.

5. Configure environment variables (tesseract3.02 installation directory)

6. Check whether the installation is complete

WIN+R Open CMD command

Enter a tesseract

If yes, the installation is complete

Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile…]

pagesegmode values are:

0 = Orientation and script detection (OSD) only.

1 = Automatic page segmentation with OSD.

2 = Automatic page segmentation, but no OSD, or OCR

3 = Fully automatic page segmentation, but no OSD. (Default)

4 = Assume a single column of text of variable sizes.

5 = Assume a single uniform block of vertically aligned text.

6 = Assume a single uniform block of text.

7 = Treat the image as a single text line.

8 = Treat the image as a single word.

9 = Treat the image as a single word in a circle.

10 = Treat the image as a single character.

-l lang and/or -psm pagesegmode must occur before anyconfigfile.

Single options:

-v –version: version info

–list-langs: list available languages for tesseract engine

Extracting text

tesseract C:/pic/1.png C:/pic/1 -l chi_sim

Image of disk C, generate 1.txt (extracted text content)

-l: not 1. It is the lowercase letter L corresponding to the English letter L

Chi_sim Chinese thesaurus

Import dependencies in Java projects

 <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>3.21.</version>
            <exclusions>
                <exclusion>
                    <groupId>com.sun.jna</groupId>
                    <artifactId>jna</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna</artifactId>
            <version>4.1. 0</version> </dependency> <! -- https://mvnrepository.com/artifact/javax.media/jai_imageio -->
        <dependency>
            <groupId>javax.media</groupId>
            <artifactId>jai_imageio</artifactId>
            <version>1.11.</version> </dependency> <! -- https://mvnrepository.com/artifact/org.swinglabs/swingx -->
        <dependency>
            <groupId>org.swinglabs</groupId>
            <artifactId>swingx</artifactId>
            <version>1.61.</version>
        </dependency>
Copy the code

Code ImageIOHelper. Class

package com.example.excel.excelexportodb.Utils;

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.Locale;

import javax.imageio.IIOImage;
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.ImageWriteParam;
import javax.imageio.ImageWriter;
import javax.imageio.metadata.IIOMetadata;
import javax.imageio.stream.ImageInputStream;
import javax.imageio.stream.ImageOutputStream;

import com.github.jaiimageio.plugins.tiff.TIFFImageWriteParam;
//import com.sun.media.imageio.plugins.tiff.TIFFImageWriteParam;


/** * class description: Create temporary image file to prevent damage to the original file */
public class ImageIOHelper {

	// Set the language
     private Locale locale=Locale.CHINESE;

	// Custom language constructs
    public ImageIOHelper(Locale locale){
        this.locale=locale;
    }
    Locale.chinese is the default constructor

    public ImageIOHelper(a){}/** * Create a temporary image file to prevent damage to the original file *@param imageFile
     * @param imageFormat like png,jps .etc
     * @return TempFile of Image
     */
    public File createImage(File imageFile, String imageFormat) throws IOException {

        // Read the image file
        Iterator<ImageReader> readers = ImageIO.getImageReadersByFormatName(imageFormat);
        ImageReader reader = readers.next();
        // Get the file stream
        ImageInputStream iis = ImageIO.createImageInputStream(imageFile);
        reader.setInput(iis);
        IIOMetadata streamMetadata = reader.getStreamMetadata();

        / / set writeParam
        TIFFImageWriteParam tiffWriteParam = new TIFFImageWriteParam(Locale.CHINESE);
        tiffWriteParam.setCompressionMode(ImageWriteParam.MODE_DISABLED);
        // Set whether compression can be performed

        // Get tiffWriter and set output
        Iterator<ImageWriter> writers = ImageIO.getImageWritersByFormatName("tiff");
        ImageWriter writer = writers.next();


        BufferedImage bi = reader.read(0);
        IIOImage image = new IIOImage(bi,null,reader.getImageMetadata(0));
        File tempFile = tempImageFile(imageFile);
        ImageOutputStream ios = ImageIO.createImageOutputStream(tempFile);

        writer.setOutput(ios);
        writer.write(streamMetadata, image, tiffWriteParam);

        ios.close();
        iis.close();
        writer.dispose();
        reader.dispose();

        return tempFile;
    }

    /** * suffixes tempFile *@param imageFile
     * @throws IOException
     */
    private File tempImageFile(File imageFile) throws IOException {
        String path = imageFile.getPath();
        StringBuffer strB = new StringBuffer(path);
        strB.insert(path.lastIndexOf('. '),"_text_recognize_temp");
        String s=strB.toString().replaceFirst("(? < = / /. (//w+)$"."tif");
        // Set file hiding
        Runtime.getRuntime().exec("attrib "+"\" "+s+"\" "+" +H");
        return newFile(strB.toString()); }}Copy the code

OCRUtil.class

package com.example.excel.excelexportodb.Utils;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.Locale;

import org.jdesktop.swingx.util.OS;

/** * Class description: OCR tool class */
public class OCRUtil {
    private final String LANG_OPTION = "-l";
    // The letter "l" is not the Arabic number "1"

    private final String EOL = System.getProperty("line.separator");

    /** * OCR installation path */
    private String tessPath =  "D: / / tess4j / / tess4j - SRC / / tesseract - OCR - set - 3.0 / / tesseract - the OCR";



    public OCRUtil(String tessPath,String transFileName){
        this.tessPath=tessPath;
    }

    //OCRUtil constructor, default path is "C://Program Files (x86)// tesseract-ocr"

    public OCRUtil(a){}public String getTessPath(a) {
        return tessPath;
    }
    public void setTessPath(String tessPath) {
        this.tessPath = tessPath;
    }
    public String getLANG_OPTION(a) {
        return LANG_OPTION;
    }
    public String getEOL(a) {
        return EOL;
    }


    / * * *@returnThe recognized text */
    public String recognizeText(File imageFile,String imageFormat)throws Exception{
        File tempImage = new ImageIOHelper().createImage(imageFile,imageFormat);
        return ocrImages(tempImage, imageFile);
    }

    /** * you can customize the language */
    public String recognizeText(File imageFile,String imageFormat,Locale locale)throws Exception{
        File tempImage = new ImageIOHelper(locale).createImage(imageFile,imageFormat);
        return ocrImages(tempImage, imageFile);
    }

    / * * *@param
     * @param
     * @returnIdentified content *@throws IOException
     * @throws InterruptedException
     */
    private String ocrImages(File tempImage,File imageFile) throws IOException, InterruptedException{

        // Set the output file directory and file name
        File outputFile = new File(imageFile.getParentFile(),"test");
        StringBuffer strB = new StringBuffer();


        // Set the command line content
        List<String> cmd = new ArrayList<String>();
        if(OS.isWindowsXP()){
            cmd.add(tessPath+"//tesseract");
        }else if(OS.isLinux()){
            cmd.add("tesseract");
        }else{
            cmd.add(tessPath+"//tesseract");
        }
        cmd.add("");
        cmd.add(outputFile.getName());
        cmd.add(LANG_OPTION);
        / / Chinese package
        cmd.add("chi_sim");
        // Use the following formula
        cmd.add("equ");
        / / English package
        cmd.add("eng");


        // Create an operating system process
        ProcessBuilder pb = new ProcessBuilder();
        // Sets the working directory for this process generator
        pb.directory(imageFile.getParentFile());
        cmd.set(1, tempImage.getName());
        // Sets the CMD command to execute
        pb.command(cmd);
        // Set error output generated by subsequent child processes to be merged with standard output
        pb.redirectErrorStream(true);

        long startTime = System.currentTimeMillis();
        System.out.println("Start time:" + startTime);
        // Start execution and return the process instance
        Process process = pb.start();
        PNG test -l chi_sim+equ+eng
        // Input/output stream optimization
// printMessage(process.getInputStream());
// printMessage(process.getErrorStream());
        int w = process.waitFor();
        // Delete temporary working files
        tempImage.delete();
        if(w==0) {// 0 indicates normal exit
            BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(outputFile.getAbsolutePath()+".txt"),"UTF-8"));
            String str;
            while((str = in.readLine())! =null){
                strB.append(str).append(EOL);
            }
            in.close();

            long endTime = System.currentTimeMillis();
            System.out.println("End time:" + endTime);
            System.out.println("Time:" + (endTime - startTime) + "毫秒");

        }else{
            String msg;
            switch(w){
                case 1:
                    msg = "Errors accessing files.There may be spaces in your image's filename.";
                    break;
                case 29:
                    msg = "Cannot recongnize the image or its selected region.";
                    break;
                case 31:
                    msg = "Unsupported image format.";
                    break;
                default:
                    msg = "Errors occurred.";
            }
            tempImage.delete();
            throw new RuntimeException(msg);
        }
        // Delete the temporary file from which the text is extracted
        new File(outputFile.getAbsolutePath()+".txt").delete();
        return strB.toString().replaceAll("\\s*"."");
    }

// public static void main(String[] args) throws IOException, InterruptedException {
// String cmd = "cmd /c dir c:\\windows";
// final Process process = Runtime.getRuntime().exec(cmd);
// printMessage(process.getInputStream());
// printMessage(process.getErrorStream());
// int value = process.waitFor();
// System.out.println(value);
/ /}
    private static void printMessage(final InputStream input) {
        new Thread(new Runnable() {
            @Override
            public void run(a) {
                Reader reader = new InputStreamReader(input);
                BufferedReader bf = new BufferedReader(reader);
                String line = null;
                try {
                    while((line = bf.readLine()) ! =null) { System.out.println(line); }}catch(IOException e) { e.printStackTrace(); } } }).start(); }}Copy the code

The main function

 public static void main(String[] args) throws TesseractException {
        String path = "D://1.png";
        System.out.println("ORC Test Begin......");
        try {
            String valCode = new OCRUtil().recognizeText(new File(path), "png");
            System.out.println(valCode);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.out.println("ORC Test End......");


    }
Copy the code

Identify the image

ORC Test Begin……

To test the 123

ORC Test End……

The final recognition effect is general, some complex character recognition appears garbled, this time we need to train the thesaurus

The more you train, the more accurate you get

SpringBoot actual combat e-commerce project mall4j address: gitee.com/gz-yami/mal…