background

Recently, the company’s project met the demand of OCR to identify images and output text. After searching tesseract.js, we tested the performance and found that the recognition was about 4s, which met the expectation. And because it is an internal bank project without an external environment, offline adaptation is required. After finishing, I found few articles about tesseract.js in China, especially the offline version, so I have this article. To source can directly jump to the end of the article themselves.

First trytesseract.js-offline

This is an official Offline version, which can be divided into bower version and Node version. I am not quite sure about the Offline version of Node. Considering that Vue uses bower environment, it is developed on this basis for reference. The key code is as follows

// index.html
<script>
const { createWorker } = Tesseract;
    const worker = createWorker({
      workerPath: '.. /node_modules/tesseract.js/dist/worker.min.js'.langPath: '.. /lang-data'.corePath: '.. /node_modules/tesseract.js-core/tesseract-core.wasm.js'.logger: m= > console.log(m),
    });

    (async() = > {await worker.load();
      await worker.loadLanguage('eng');
      await worker.initialize('eng');
      const { data: { text } } = await worker.recognize('.. /images/testocr.png');
      console.log(text);
      awaitworker.terminate(); }) (); </script>Copy the code
Analysis of the

The offline version is different from the official version because the workerPath, langPath, and corePath of the createWorker initialization configuration items are downloaded to the local project. However, a vue application using the above code will not find the path, let alone the NPM run build production environment without node_modules.

To solve the key

The key is the path path. The solution is as follows

  1. The installationtesseract.js,tesseract.js-core: npm i -S tesseract.js tesseract.js-core
  2. Go to node_modules and find the two dependencies above and copy them to the vue projectpublicfolder
  3. Next, solve the language pack. Open itTesseractjs official language package address, find the language package you need, I need Chinese simplified:chi_sim.traineddata.gz (Tested – Simplified Chinese language pack also supports 26 English identifiers)
  4. Place the language pack the same waypublicdirectory
  5. Modify the codepathPath, as follows
const worker = createWorker({
  workerPath: "/tesseract/tesseract.js/dist/worker.min.js",
  langPath: "/tesseract/lang-data",
  corePath: "/tesseract/tesseract.js-core/tesseract-core.wasm.js",
  logger: (m) => console.log(m),
});
Copy the code

other

Resource CDN optimization

The file sizes of tesseract.j, lang-data and tesseract.js-core in the public directory are all over 10m, which will slow down the NPM run build and deployment to the server in the future. Therefore, it is recommended that students with conditions put it into CDN.

github

Github.com/q27488/tess…