Python recognizes text in images

One, foreword

Do not know you have ever encountered such a problem, is in a software or a web page inside an article, you like very much, but can not copy. Or like Baidu documents, can only copy a part, this time we will choose to save screenshots. But when we want to use the words inside, we still have to type them out word by word. So can we recognize the text in the picture directly? The answer is yes.

Second, the Tesseract

Character recognition is part of ORC, ORC stands for Optical character recognition, commonly known as character recognition. Tesseract is a tool for word recognition that we can quickly implement in conjunction with Python. But before we can do that, we need to finish a tedious task.

(1) Tesseract installation and configuration

To install Tesseract, go to https://digi.bib.uni-mannheim.de/tesseract/, and you can see the following interface:

There are many versions for you to choose from, and you can choose according to your own needs. W32 represents the 32-bit system, W64 represents the 64-bit system, you can choose the appropriate version, may be slow download speed, you can choose the link: pan.baidu.com/s/1jKZe_ACL… Extract code: Ayel download. When installing, we need to know where we are installing. Configure the installation directory in the system path variable. Our path is D:\CodeField\ tesseract-ocr.

Let’s right click on my computer/This computer -> Properties -> Advanced System Settings -> Environment Variables ->Path-> Edit -> New and copy our Path into it. After adding the system variables, we need to click confirm in order to calculate the configuration.

(2) Download the language package

Tesseract is by default does not support Chinese, if you want to identify Chinese or other languages need to download the corresponding language pack, download address is as follows: https://tesseract-ocr.github.io/tessdoc/Data-Files, enter the website after we turned down:

There are two Chinese language packs, chinese-Simplified and Chinese-traditional, which we can download as many as we want. After the download, we need to put it in the tessData directory under the Tesseract path. Our path is D:\CodeField\ tesseract-ocr \ tessData.

(3) Download other modules

In addition to the steps above, we need to download two modules:

pip install pytesseract
pip install pillow
Copy the code

The first is for text recognition and the second is for picture reading. And then we can do text recognition.

Three, character recognition

(1) Single picture recognition

The next step is much simpler. Here are the images we want to identify:

Here’s our code for word recognition:

import pytesseract
from PIL import Image
# Read image
im = Image.open('sentence.jpg')
# Identify text
string = pytesseract.image_to_string(im)
print(string)
Copy the code

The identification results are as follows:

Do not go gentle into that good night!
Copy the code

Since English is supported by default, we can recognize it directly, but we need to make some changes when we want to recognize Chinese or other languages:

import pytesseract
from PIL import Image
# Read image
im = Image.open('sentence.png')
# Identify text and specify language
string = pytesseract.image_to_string(im, lang='chi_sim')
print(string)
Copy the code

For identification, we set lang=’chi_sim’, which means to set the language to simplified Chinese. This setting will only work if you have simplified Chinese packages in your TessData directory. Here are the images we used to identify:

The identification results are as follows:

Do not go gently into that good nightCopy the code

The picture content was correctly identified. One thing we need to know is that Tesseract will recognize English characters even if we set the language to simplified Chinese or some other language.

(2) Batch picture recognition

Now that we have listed a single picture recognition, there must be batch image recognition this function, which requires us to prepare a TXT file, such as I have a text. TXT file, the contents are as follows:

sentence1.jpg
sentence2.jpg
Copy the code

We modify the code as follows:

import pytesseract
# Identify text
string = pytesseract.image_to_string('text.txt', lang='chi_sim')
print(string)
Copy the code

But it is hard to avoid some trouble to write a TXT file, so we can make the following modifications:

import os
import pytesseract
# Path to the text image
path = 'text_img/'
# Get image path list
imgs = [path + i for i in os.listdir(path)]
# Open file
f = open('text.txt'.'w+', encoding='utf-8')
# Write the path of each image to the text.txt file
for img in imgs:
    f.write(img + '\n')
# Close file
f.close()
# Character recognition
string = pytesseract.image_to_string('text.txt', lang='chi_sim')
print(string)
Copy the code

So we only need to pass a text image root directory can batch identification. In the process of testing, it was found that Tesseract is not accurate for handwriting, xingkai and other elegant fonts, and it also needs to be improved for some complex word recognition. However, font recognition accuracy is very high, such as Song style and printing style. In addition, if the image is tilted more than a certain Angle, the identification results will be very different.

More content can pay attention to the public number: new folder X.