background

Computer vision is to use cameras and computers to replace human eyes, so that the computer has the function of detecting, recognizing, understanding, tracking and discriminating the target similar to human beings. Taking Meituan business as an example, the application of computer vision is involved in many links, including text recognition, picture classification, target detection and image quality evaluation, etc., in business order, group order display and consumption evaluation. This paper will introduce the application of deep learning in computer vision through the OCR (Optical character recognition) scenario.

OCR based on deep learning

Text is an indispensable source of visual information. Compared with other contents in images/videos, text often contains stronger semantic information, so it is of great significance for text extraction and recognition in images. OCR plays two main roles in Meituan business. On the one hand, the input is assisted. For example, in the mobile payment link, the card can be automatically tied through the photo recognition of bank card number, the menu information can be assisted to the operation, and the dispatch and verification can be realized through the recognition of merchant receipts in the distribution link, as shown in Figure 1. On the other hand, it is verification. For example, information extraction and verification are carried out on the id card, business license and catering license photos uploaded by businesses to ensure the legitimacy of the businesses in the process of business qualification verification. The machine filters the pictures containing prohibited words generated in the process of business order and user evaluation.

OCR technology development history

Traditional OCR based on image processing (binarization, connected domain analysis, projection analysis, etc.) and statistical machine learning (Adaboost, SVM) has achieved good results in printed and scanned documents over the past 20 years. The overall flow of a traditional printed OCR solution is shown in Figure 2.

From the input image to the recognition result, there are three stages: image preprocessing, text line extraction and text line recognition. The relevant steps of text line extraction (layout analysis, line segmentation) involve a large number of prior rules, and text line recognition is mainly based on traditional machine learning methods. With the popularity of mobile devices, text extraction and recognition in the image has become the mainstream demand, and text recognition in the scene is becoming more and more prominent. Therefore, compared with printed scenes, the recognition of photographed text will face the following three challenges:

  • Imaging is complex. Noise, blur, light change, deformation.
  • The text is complicated. Font, size, color, wear, stroke width, direction arbitrary.
  • The scene is complex. Missing pages and background interference.

Traditional OCR solutions to these challenges have the following shortcomings:

  • To generate a written line through layout analysis (connected domain analysis) and line segmentation (projection analysis), the layout structure is required to have strong regularity and strong separability of the front background (such as black and white document image, license plate), and cannot process random text with complex front background (such as scene text, menu, advertising text, etc.). In addition, the binarization operation itself has strict requirements on image imaging conditions and background.
  • By artificially designing edge directional features (such as directional gradient histogram) to train character recognition models, the generalization ability of such single features decreases rapidly when font changes, blurring or background interference occur.
  • Over-dependence on the result of character segmentation, in the case of character distortion, adhesion, noise interference, segmentation error propagation is particularly prominent.
  • Although the image preprocessing module can effectively improve the quality of the input image, the series of several independent correction modules will inevitably lead to error transmission. In addition, because the optimization objectives of each module are independent, they cannot be integrated into a unified framework.

In order to solve the above problems, the existing technology has been improved in the following three aspects.

1. Text line extraction

Traditional OCR (as shown in Figure 3) adopts a top-down segmentation formula, but it is only suitable for the simple background of layout rules. There are two other kinds of thinking in the field.

  • Bottom-up generative approach. This method extracts candidate regions through connected domain analysis or maximum stable extremum region (MSER) and other methods, and then conducts region screening through text/non-text classifier, merges the filtered regions to generate text lines, and then conducts text line-level filtering, as shown in Figure 3. The disadvantage of this approach is that, on the one hand, the process is long, resulting in too many hyperparameters, and on the other hand, global information cannot be exploited.

  • Method based on sliding window. This method uses the idea of general object detection to extract text line information, and uses the trained text line/word/character level classifier to search the whole image. The original sliding window-based method directly performs multi-scale window scanning on the input image by training the text/background binary classification detector. The detector can be a traditional machine learning model (Adaboost, Random Ferns) or a deep convolutional neural network.

In order to improve efficiency, DeepText, TextBoxes and other methods firstly extract candidate regions and then carry out regional regression and classification. Meanwhile, this kind of method can carry out end-to-end training, but recall text regions with multiple angles and extreme aspect ratio is low.

2. Traditional word recognition engine → Word recognition engine based on deep learning

Since the training of single-word recognition engine is a typical image classification problem, and convolutional neural network has obvious advantages in describing high-level semantics of images, the mainstream method is image classification model based on convolutional neural network. The key point in practice is how to design network structure and synthesize training data. As for the network structure, we can refer to the network structure related to handwriting recognition field or adopt Maxout network structure which has achieved excellent results in OCR field, as shown in Figure 4. For data synthesis, font, deformation, blur, noise, background change and other factors should be considered.

Table 1 shows the performance comparison between feature learning and traditional features of convolutional neural network. It can be seen that features acquired through convolutional neural network learning have stronger identification ability.

3. Text line recognition process

Traditional OCR text line recognition is divided into character segmentation and character recognition of two separate steps, though with training based on convolutional neural network character recognition engine can effectively enhance the character recognition rate, but the segmentation in the case of conglutination, fuzzy and deformation of fault tolerance is poorer, and the segmentation error for recognition is irreparable. Therefore, the accuracy of text line recognition is mainly limited by character segmentation under this framework. Assuming that the accuracy of the trained single character recognition engine is P= 99% and the character segmentation accuracy is Q = 95%, the average accuracy of the recognition of a text line with length L is P= (pq) to the L power, where L=10, P=54.1%.

Due to the limited space for improving character segmentation independently, there are related methods to jointly optimize the segmentation and recognition tasks. Existing technologies can be divided into two types: Segmentation-Based and Segmentation- Free methods.

  • Based on the method of segmentation

This method still retains the step of active segmentation, but introduces dynamic merging mechanism to guide segmentation by identifying information such as confidence, as shown in FIG. 5.

The overshard module splits the text line into fragments perpendicular to the baseline, such that each fragment contains at most one character. In general, an overshard module splits a character into consecutive strokes. Oversharding can be based on rules or machine learning. The regular method is mainly to determine the position of the alternate tangent point by connecting domain analysis and projection analysis directly on the result of image binarization. The granularity can be controlled by adjusting parameters to make the characters as shredded as possible. The rules-based method is simple to implement, but it is not effective in complex imaging/background conditions. The machine learning method identifies the binary classifier of tangent points through off-line training, and then performs sliding window detection on the text line image based on the classifier.

The dynamic merging module combines adjacent strokes into possible character regions according to the recognition results, and the optimal combination method corresponds to the optimal segmentation path and recognition results. Intuitively, the search for the optimal combination can be transformed into a path search problem, and there should be two search strategies: depth first and breadth first. Depth-first extends the current optimal state at each step, so it is globally suboptimal and not suitable for excessively long lines of text. Breadth-first strategies, such as Viterbi decoding and Beam Search, are widely used in speech recognition, extend the current state simultaneously at each step. But for performance reasons, Beam Search usually introduces pruning operations to control path length, including limiting the number of states that can be extended (for example, only TopN states can be extended per step) and adding state constraints (for example, character shapes after merging).

Because dynamic merging will produce multiple candidate paths, it is necessary to design appropriate evaluation function for path selection. The design of evaluation function mainly starts from two aspects of path structure loss and path recognition score. Path structure loss mainly measures the rationality of the segmented path in terms of character shape characteristics, and path recognition score corresponds to the average recognition confidence of single word and language model score under a particular segmented path.

This scheme tries to combine character segmentation and single character recognition in the same framework, but because over-segmentation is an independent step, it does not realize end-to-end learning in essence.

  • It doesn’t rely on the method of segmentation

This method bypassing character segmentation completely and recognizes text lines directly through sliding Windows or sequence modeling.

Sliding window recognition is based on the idea of sliding window detection. Based on the single word recognition engine of offline training, the image of text line is scanned at multiple scales from left to right, and the recognition is centered on a specific window. Greedy strategy or non – maximum suppression (NMS) strategy can be used to get the final identified path. FIG. 6 shows the schematic process of sliding window recognition. There are two problems in visible sliding window recognition: if the granularity of sliding step is too large, the calculation cost is high; if the granularity is too large, the context information is easy to be lost; No matter which path decision scheme is adopted, they are highly dependent on the confidence of single word recognition.

Sequence learning originated from handwriting recognition and speech recognition, because the common feature of these problems is the need to model time series data. Although the text line image is two-dimensional, the text line recognition can be essentially classified into this problem if the left-to-right scanning action is likened to a sequence. It has become a hot topic of current research to improve the effect of sequence learning by eliminating intermediate steps such as correction/segmentation/character recognition through end-to-end learning.

Based on the existing technology and the OCR scenario involved in Meituan business, we adopt the deep learning framework as shown in Figure 7 in the text detection and text line recognition.

The specific schemes of text detection and text line recognition will be introduced respectively.

Word detection based on deep learning

For Meituan OCR scene, according to the layout if there were any priori information (card of the rectangular area, certificates of key field) and the complexity of the text itself (e.g., level of text, multi-angle), the image can be divided into controlled scenarios (such as id card, business license, bank card) and uncontrolled scenarios (such as menu, door head figure), as shown in figure 8.

Considering the different characteristics of these two scenarios, we use different detection frameworks for reference. Due to many constraints of controlled scene text, the problem can be simplified, so the Faster R-CNN framework, which is widely used in the field of general target detection, is used for detection. For uncontrolled scene text, due to deformation and inconsistent stroke width and other reasons, the target contour does not have a good closed boundary, we need to use image semantic segmentation to mark the text region and background region.

1. Text detection of controlled scenes

For controlled scenarios (such as ID cards), we transform text detection into detection questions for keyword targets (such as name, ID number, address) or key items (such as bank card number). The keyword detection process based on Faster R-CNN is shown in Figure 9. In order to ensure the positioning accuracy of regression frame and improve the operation speed, we fine-tune the original frame and training method.

  • Considering the limited in-class variation of keywords or key items, only three convolution layers are used in the network structure.
  • Increase the overlap rate threshold of positive samples during training.
  • According to the aspect ratio range of keywords or key items, it ADAPTS the aspect ratio of RPN layer Anchor.

The Faster R-CNN framework consists of two sub-networks, RPN (Candidate region generation network) and RCN (Region classification network). RPN extracts candidate regions through supervised learning, and gives unlabeled regions and coarse localization results. RCN introduces the concept of category, classifies candidate regions and gives the results of fine location. During the training, the two sub-networks are optimized jointly in an end-to-end manner. Figure 10 shows the output of RPN layer and RCN layer by taking the identification of bank card number as an example.

For the scene of people holding identification documents, because the proportion of identification targets in the image is too small, direct extraction of small candidate targets will lead to a certain loss of positioning accuracy. In order to ensure high recall and high positioning accuracy, the coarse to fine detection strategy can be adopted. Firstly, the region where the card is located is located, and then keyword detection is carried out within the region of the card. The Faster R-CNN framework can also be used for region location, as shown in Figure 11.

  1. Text detection for uncontrolled scenes

For uncontrolled scenes such as menus and door headers, the text line positioning task in this scene is very challenging due to the multiple angles of the text line itself and the large variation of the stroke width of the characters. Since the particle size of the general target detection method is at the level of regression box, this method is suitable for rigid objects with good closed boundaries. However, the text is usually composed of a series of loose strokes, especially for the text with arbitrary direction or stroke width, there will be a large deviation in the positioning result of the regression box only. In addition, rigid body detection is relatively low, even if only part of the body is located (for example, the overlap rate between the location result and the truth value is 50%), it will not have a significant impact on the rigid body recognition, and such positioning error is likely to be fatal to the text recognition.

In order to achieve sufficiently fine positioning, we used full convolutional network (FCN) commonly used in semantic segmentation to carry out text/background annotation at pixel level. The overall process is shown in Figure 12.

Multi-scale full convolutional network realizes the combination of global features and local features through the fusion of multi-stage deconvolution results, thus achieving pixel-level annotation from coarse to fine, which is suitable for any uncontrolled scene (door head diagram and menu picture).

A series of connected regions (stroke information) can be obtained by using the connected domain analysis technique based on the pixel-level annotations obtained by multi-scale full convolution network. However, since it is impossible to determine which connected domains belong to the same text line, single chain clustering is needed to extract the text line. As for the distance measurement involved in clustering, features are mainly extracted from the distance between connected domains, shape, color similarity and other aspects, and feature weights and thresholds are obtained adaptively through metric learning, as shown in Figure 13.

FIG. 14 shows the localization effect of convolutional network in menu and door head diagram scenarios respectively. The second column is pixel-level annotation result of full convolutional network, and the third column is final text detection result. It can be seen that full convolutional network can better deal with complex layout or multi-angle text positioning.

Word recognition based on sequence learning

We reduce the whole line recognition problem to a sequence learning problem. A bi-directional Long short-term Memory (BLSTM) based recursive neural network is used as a sequence learner to effectively model the internal relationships of sequences. In order to introduce more effective input features, we use convolutional neural network model for feature extraction to describe the high-level semantics of images. In addition, in the design of Loss function, considering that the sequence of output sequence and input feature frame cannot be aligned, we directly use structured Loss (sequence to sequence Loss), and introduce background (Blank) category to absorb the confusion of adjacent characters.

The overall network structure is divided into three layers: convolutional layer, recursive layer and translation layer, as shown in Figure 15. The feature extraction of convolution layer; The recursion layer not only learns the sequence relation of characters in feature sequence, but also learns the sequence relation of characters. The translation layer decodes the classification results of time series.

For the input image with fixed height H0 = 36 (arbitrary width, such as W0 = 248), we extracted features through CNN network structure and obtained a feature graph of 9×62×128, which can be regarded as a time series with length 62 and input to RNN layer. RNN layer has 400 hidden nodes, and the input of each hidden node is a 9× 128-dimensional feature, which is the description of the local area of the image. Considering that the image area corresponding to the feature at a certain time has strong correlation with the content before and after it, we generally adopt bidirectional RNN network, as shown in FIG. 16.

The bidirectional RNN is followed by a full connection layer, and the input is the feature graph output by the RNN layer (at a certain moment), and the output is the probability that the position is the background and the text in the character table. The full connection layer is followed by CTC (connectionist time classifier) as the loss function. During the training, according to the corresponding text and background probability distribution at each moment, the probability P(ground truth) of the truth string appearing in the image is obtained, and -log(P(ground truth)) is taken as the loss function. During the test, CTC can be regarded as a decoder, which combines the prediction results at each moment (the characters corresponding to the maximum posterior probability of the current moment), and then removes the blank and repeated patterns to form the final sequence prediction results, as shown in FIG. 17.

As you can also see from Figure 17, the LSTM output layer produces a distinct spike for each character in the input sequence, although the spike does not necessarily correspond to the central position of the character. In other words, after the introduction of CTC mechanism, we do not need to consider the specific position of each character, but only pay attention to the corresponding text content of the whole image sequence, and finally realize end-to-end training and prediction of deep learning.

Since the sequential learning framework has high requirements on the number and distribution of training samples, we adopt the method of real samples + synthetic samples. The real sample is based on meituan business sources (e.g. menu, ID card, business license), while the synthetic sample takes font, deformation, blur, noise, background and other factors into account.

Based on the above sequence learning framework, we present the text line recognition results in different scenarios, as shown in FIG. 18. The first two lines are the scene of verification code, the third act bank card, the fourth act qualification certificate, the fifth act door head diagram, the sixth act menu. It can be seen that the recognition model has good robustness for text deformation, adhesion, image blur and light change, background complexity, etc.

Based on the above experiments, compared with traditional OCR, we have greatly improved the performance of text recognition in various scenes, as shown in Figure 19.

Compared with traditional OCR, OCR based on deep learning has a significant increase in recognition rate. However, for specific application scenarios (business license, menu, bank card, etc.), item accuracy still needs to be improved. On the one hand, text detection based on deep learning and traditional layout analysis should be combined to further improve the detection performance in restricted scenarios. On the other hand, it is necessary to enrich real training samples and language models to improve the accuracy of character recognition.

reference

[1] H. Chen, S. S. Tsai, G. Schroth, D. M. Chen, R. Grzeszczuk, and B. Girod. “Robust text detection in natural images with edge-enhanced maximally stable extremal regions.” ICIP 2011.

[2] Z Zhong, LJin SZhang, ZFeng. “DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images. Architecture Science 2015.

[3] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, Wenyu Liu. “TextBoxes: A Fast Text Detector with A Single Deep Neural Network “. AAAI 2017.

[4] S. Ren, K. He, R. Girshick, “An Approach to Real-time Object Detection with Region Proposal Networks.” NIPS 2015. [5] Graves, A. Fernandez, S.; Gomez, F.; And Schmidhuber, J. “Connectionist temporal Classification: Labelling unsegmented sequence data with recurrent neural networks. “ICML 2006.

[6] R Girshick, JDonahue TDarrell, JMalik. “Rich Feature Hierarchies for Accurate Object Detection and Semantic CVPR Segmentation. “2014.

[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. “You Only look once: Unified, real-time object Detection “. CVPR 2016.

[8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. “SSD: Single shot multibox detector”. ECCV 2016.

[9] “Discriminatively Trained Part-Based Models for Discriminatively Trained Objects”. TPAMI 2010.

[10] Robust Real-time Object Detection. Paul Viola, Michael Jones. IJCV 2004.

[11] N. Markus, M. Frljak, I. S. Pandzic, J. Ahlberg and R. Forchheimer. “Object Detection with Pixel Intensity Organized in Decision Trees”. CoRR 2014.

[12] Shengcai Liao, Anil K. Jain, and Stan Z. Li. “A Fast and Accurate Unconstrained Face Detector,” TPAMI 2015.

[13] Dong Chen, ShaoQingRen, Jian Sun. “Joint Cascade Face Detection and Alignment”, ECCV 2014.

[14] Haoxiang Li, Zhe Lin, XiaohuiShen, Jonathan Brandt, Gang Hua. “A convolutional neural network cascade for face detection”, CVPR.2015.

[15] Lichao Huang, Yi Yang, Yafeng Deng, Yinan Yu. “DenseBox: Unifying Landmark Localization with End to End Object Detection “CVPR 2015.

[16] Taigman Y, Yang M, Ranzato M A, et al. Deepface: Closing the gap to human-level performance in face verification.CVPR 2014.

[17] Sun Y, Wang X, Tang X. Deep learning face representation from predicting 10,000 classes.CVPR 2014.

[18] Sun Y, Chen Y, Wang X, et al. Deep learning face representation by joint identification-verification.NIPS. 2014.

[19] FaceNet: A Unified Embedding for Face Recognition and Clustering. CVPR 2015.

[20] A Discriminative Feature Learning Approach for Deep Face Recognition. ECCV 2016.

[21] Rethinking the Inception Architecture for Computer Vision. CVPR 2016.

[22] Alex Krizhevsky, IlyaSutskever, Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. 2014.

[23] Murray, N., Marchesotti, L., Perronnin, F. “Ava: A large-scale database for aesthetic visual analysis”. CVPR 2012.

Team introduction

Meituan Dianping algorithm team is the “brain” of meituan Dianping technology team, covering search, recommendation, advertising, intelligent scheduling, natural language processing, computer vision, robotics, unmanned driving and other technical fields. It has helped hundreds of millions of active meituan-Dianping users improve their user experience and millions of merchants in more than 200 categories, including catering, hotel, marriage, beauty and parent-child, improve their operation efficiency. At present, Meituan-Dianping algorithm team is actively exploring and researching in the field of artificial intelligence, constantly innovating and practicing, and is committed to applying the most cutting-edge technology to bring better life service experience to advertising consumers.