The beginning of the story
Go to finance department to check the salary slip of last month today! Found a goddess face unhappy! It’s like the sky is falling! I finish salary to ask: goddess, you zha not happy, not be about to send salary immediately! The goddess said: the boss just sent me a task, let me make an Excel sheet for the invoices of this month! Give it to him by the end of the day! So many invoices, I can’t categorize until tomorrow! You don’t have to work today! I said: then I will give you ten minutes to finish it, after work you can buy me a big dinner, after all, this kind of close the distance of the opportunity is not many! Of course she must have a look of disbelief! Then I’ll let my technology conquer her!
The text start
Here we take four invoices as an example and place the picture of the invoice in the PIC folder.
Open a receipt at random
These are the bill that I look for on the net certainly won’t take the bill of the company to do tutorial! Then I guess tomorrow I’ll be packing my bags with my accounting sister! That still don’t hate me! Ha, ha, ha
Extraction target: amount, name, taxpayer identification number, drawer.
Finally, save the four contents of each invoice into Excel:
You need to use the library
The required libraries are as follows:
from PIL import Image as PI
import pyocr
import pyocr.builders
from cnocr import CnOcr
Copy the code
The installation command is as follows:
pip install pyocr
pip install cnocr
Copy the code
Installation is very simple!
The invoice contains Chinese content, we need to identify the Chinese in the picture, then CNOCR is a good choice.
Note: In addition to installing the above library, you need to install an additional EXE file, otherwise the following error will occur
Exe files to install:
1. ImageMagick
2. tesseract-OCR
The installation process of these two software is no longer described, you can search for tutorials to install.
03. Extract content
Here, take one of the pictures as an example to explain how to extract the target content: amount, name, taxpayer identification number and drawer.
Read the picture: PIC /pic1.jpg
tool = pyocr.get_available_tools()[0]
img_url = "pic/pic1.jpg"
with open(img_url, 'rb') as f:
a = f.read()
new_img = PI.open(io.BytesIO(a))
Copy the code
Extracting amount
Need to intercept to the position of the invoice amount
Image_text1 = new_img.crop((left, top, right, Image_text1.show ()Copy the code
The left, top, right, and bottom values have been modified several times. We can locate according to the content of their invoices.
The numbers in the picture are then extracted
Again, continue extracting: name
Extract the name
left = 155
top = 450
right = 450
bottom = 470
image_obj2 = new_img.crop((left, top, right, bottom))
image_obj2.show()
Copy the code
The name here is Chinese, we can no longer like withdrawal amount (number) operation. It is necessary to use CNOCR to remove Chinese from the picture.
image_obj2.save("tmp.jpg")
ocr = CnOcr()
res = ocr.ocr("tmp.jpg")
print("".join(res[0]))
Copy the code
Extract taxpayer identification number
Image_text3 = new_img.crop((left, top, right, Image_text3.show ()Copy the code
txt3 = tool.image_to_string(image_text3)
print(txt3)
Copy the code
The taxpayer identification numbers in the picture are extracted and the results are as follows:
Draw the drawer
left = 528
top = 550
right = 670
bottom = 600
image_obj4 = new_img.crop((left, top, right, bottom))
image_obj4.show()
Copy the code
image_obj4.save("tmp.jpg")
ocr = CnOcr()
res = ocr.ocr("tmp.jpg")
print("".join(res[0]))
Copy the code
As there are Chinese characters, we use CNOCR to extract Chinese characters from the picture, just like extracting names.
Ok, so we will extract the four target contents in the invoice, and then identify all the invoices under the folder PIC and save the contents to Excel.
04. Batch identify invoices and save them in Excel
Before reading the picture, wrap the above four operations into functions that are easy to call from each invoice object.
Read all the pictures in the folder.
filePath = 'pic'
pic_name = []
for i,j,name in os.walk(filePath):
pic_name = name
for i in pic_name:
print(i)
Copy the code
Begin the identification and write the results to Excel.
for i in pic_name: img_url = filePath+"/"+i with open(img_url, 'rb') as f: A = f.read() new_img = pi.open (io.bytesio (a)) ## write CSV outws.cell(row=count, column=1, value=text2(new_img)) outws.cell(row=count, column=2, value=text3(new_img)) outws.cell(row=count, column=3, value=text1(new_img)) outws.cell(row=count, column=4, Value =text4(new_img)) count = count + 1 outwb.save(" invoice pool-xls ") #Copy the code
Finally saved as: Invoice summary – Li Yunchen.xls, the result is as follows:
6. Summary
This paper is basically successful to achieve the goal requirements, from the effect is still very good! Complete source code can be combined by the text (has been all shared in the text), interested readers can try their own!
Be sure to try ****! Be sure to try ****! Do try it!
Finally, the examples in this paper can be applied in other ways, for example
-
Batch calculation of invoice sum summary
-
Batch classification according to invoice type
-
…
Then today promised the goddess request is finished! After finishing, he and the goddess went home to cook for me!
This happiness comes too suddenly ah! On the need to remember to click on the source code blue font: click here to get or add Q group: 754370353 self can know you lazy, I have put a folder
Without further ado, brothers, I’m going to eat!