1. OCR overview

1.1 EVOLUTION of OCR technology

  • Traditional image, Gonzales image processing.

  • Signal processing, frequency domain analysis and various algorithms: SIFT, HOG, HOUGH, Harris, Canny… All good.

  • The industry has basically shifted to depth since 2016, because it works really well.

1.2 OCR technology business services

  • Id card card card class is relatively easy, but should accomplish complex scene, also not so easy.

  • Invoices and business documents are relatively complex. In addition to identification, layout analysis is more important.

  • Table recognition is hot recently, everyone is working hard to achieve, Microsoft’s open TableBank data set

  • Mobile backboneMobileNet, or TesserAct + OpencV

Second, our business scenario

2.1 Service Requirements

Business satisfaction is the first need. Different from Dafa, the external service API requires strong concurrency and complete diversity of categories. We put more emphasis on single products to meet business requirements as far as possible, and more emphasis on customization.

2.2 Identify problems to be solved during the process

Three, OCR algorithm in detail

3.1 Algorithm overview — Sharing principle

A model, to be fully understood, requires:

  • What is the goal, purpose, meaning?

  • What is the network structure?

  • Loss is what?

  • How to do the sample?

  • What the post-processing did

3.2 Overview of the algorithm — Three major sections

  • Text detection: the text box, narrow down to the minimum range, so as to reduce the difficulty of recognition.

  • Text recognition: after detecting the text, you can identify the text through the recognition tool (algorithm), such as the middle figure.

  • Layout analysis: When the text is recognized, we get the text and the corresponding coordinates, but when the real business to get more than this, there needs to be a structure, how to identify the text typesetting into a logical structure of documents or content, this work is also super complex. About layout analysis, there will be a team of very experienced partners to share with you.

3.3 Algorithm Overview — Detection algorithm

  • Table from the bottom up detection algorithm sort by effect: getting better

  • From anchor-based (the bottom image shown on the right) and now gradually pixel-based (the middle image shown on the right), the technique for primarily semantic segmentation is just too good.

CTPN: an algorithm for finding boxes.

The final prediction result is: the y coordinate offset of 10 Anchors, the adjusted height value, and the probability of whether it is the foreground. The output is the front and back scene probability [N,10,2], and the y and w adjustment values [N,10,2]. It only works horizontally, or vertically, not both.

  • A model is mainly understood from the following aspects

  • Highlights and core ideas are: prediction box and text line construction algorithm

  • What is loss function: Front and back probability, Y, W adjustment of Anchor

  • How to do label: big box, make small box, and then positive and negative sample equilibrium

  • post-processing

  • The algorithm was named EAST(Efficient and Accuracy Scene Text) because it is an Efficient and accurate Scene Text detection pipeline.

  • Firstly, the image is sent to the FCN network structure and the single-channel pixel-level text fractional feature map and multi-channel geometric feature map are generated. The text area adopts two geometric shapes: RBOX and QUAD, and different loss functions are designed for each shape. Thresholds are then applied to each prediction region, where the geometry scoring above the predetermined threshold is considered valid and saved for subsequent non-maximal suppression. The result after the NMS is considered the final result of the pipeline.

  • Final prediction: Scoremap, Textbox, textrotation

  • The labels are: a mask, a 4 images, up, down, left, and right, and an Angle: 3 altogether.

  • The corresponding can be loss. Each point predicted, plus the Angle, is 1 box, too many boxes, so do LANMS merge. Why not use socreMap directly? I think it is not reliable enough, so I need to add bbox to strengthen the verification.

PSENet is a new instance segmentation network with two advantages. First of all, pSENet, as a segmentation based method, can locate the text of any shape. Secondly, the model proposes a progressive scaling algorithm, which can successfully identify adjacent text instances.

  • FPN, resnet50 on the left. Why is resnet50, the reason is good effect, the parameters are moderate.

  • There are 6 scales in the paper, can’t one? I understand it to be completely separate from each other, gradually expand, gradual scale to prevent each other from crossing

  • FPN and UNET are both concat, FCN is ADD, this detail.

  • With the DB module, the binarization operation becomes differentiable and can be added to the network for training.

Network output

  • The probabilityMap represents the probability that pixels are text

  • Thresholdmap, the threshold for each pixel

  • Binarymap, calculated from 1,2, and calculated by DB formula

Label production

  • Make the probabilityMap as PER PSE, and set the shrinkage ratio to 0.4

  • Thresholdmap, shrinks and expands the text box inward and outward by D pixels (calculated in the first step of shrinking), then calculates the normalized distance of each pixel in the difference set between the shrink box and the expanded box to the original image boundary.

3.4 Algorithm overview — Identification algorithm

  • Atttenion: Attention-Base dextraction of Structured Information from Street View Imagery — First attempt in 2017

A very classical algorithm, the main core of which is the CTC algorithm: Connectionist Temporal Classification (CTC) is suitable for use in situations where you don’t know if the input and output are aligned, so CTC is suitable for speech recognition and handwritten character recognition tasks.

Faults: It is not possible to accurately relate the feature vector with the corresponding target region in the input image. This phenomenon is called attention drift.


  • What do we know? What character? What number? This information!

  • Which character? Find that character. What is it? And then compare that to the order in the sample

  • Which character is it? And the corresponding position of the character ratio

  • So you can’t have duplicate characters in the sample.

Fourth, our practice

4.1 The Road to Practice

  • Non-documents: aspect ratio, white pixel ratio, etc

  • Rotation Angle: we’ve already seen that, by the rotation model, and the projection distribution

  • Multiple documents: Multiple documents together, through projection, the threshold value exceeds the parameter configuration

  • Table recognition: Mask-RCNN method is used to find the edges of large tables

  • Post-processing: error correction via NLP, which will be discussed in detail later

4.2 The Road to Practice — Rotating model

General direction judgment

First edition:

  • VGG serves as backbone, full connection, and four categories

  • Sample: manual annotation and enhancement

  • The accuracy of 90%

The second version:

  • Do the cutting, 256×256

  • Use MSER to find alternatives

  • Training insets

  • The mode selects the most likely direction

  • The accuracy of 99.7%


  • Each rotation of 1° is projected vertically

  • The Angle with the largest variance is the fine-tuning Angle

4.3 Pits we encountered

  • Replace the custom CNN network in the CRNN paper with resnet, but resnet is 32 times smaller, so it should be elongated to 512.

  • The first is: the sample set is 10 million (500,000 pieces, confidence level of single word 95%+)1 million real +1 million common words (make) + 2 million digital time English (make) + 6 million other Characters (make) about 3-4 days

  • Next, training: Resnet50, 5-6 days; Resize: 1024, =>512×8, 256×8

Greedy algorithm needs to be improved during the process:

=> Beam_search /merge_repeated=True. If the confidence is high, the difference is small, but the speed is greatly improved. 28 seconds =>10 seconds, batch=128, size is 512x32Copy the code

  • Because we have CRNN prob, so the error correction is targeted, we replace the word of doubt with some word,

  • Prob has one detail, if it’s next to each other, “me me __”, take the largest Prob,

  • Is based on a calligraphy and painting proximity, the principle of doubt word replacement, and the original recognition of the word stroke is the closest, but also by editing distance.

4.4 Our experience

1. Development experience

2. Production experience

Tensorflow container

  • The model is deployed using the officially recommended TensorFlowserving, container mode

  • Batching is not started and batch is controlled by oneself

  • The host only needs a graphics card driver. CUDA and cuDNN are contained in the container to avoid version adaptation

Service container:

  • Define your own Web container base image

  • Automatic container building, dynamic orchestration

The writer: Liu Chuang, CreditEase Institute of Technology