Preface: spring recruit season again! As a national travel service platform, Autonavi’s business is developing rapidly, and a large number of school/social recruitment positions are open. You are welcome to submit your resume, details are at the end of the article. In order to help you know more about Autonavi technology, we planned a series of articles **# Spring recruitment column, and organized senior students of each business team to make relevant introductions with business science popularization + technology application practice ** as the main content.
This article is the third in the series of # Spring recruitment column. It is arranged according to the content of “The practice of visual technology in the automatic generation of POI name”, which was shared by Hao Zhihui, head of basic R&D department of Autonavi Visual Technology Center, in the technical forum of AT.
**AT Technology Tribune (Amap Technology Tribune) is a technical exchange activity initiated by Autonavi. Each session focuses on one theme. We will invite experts from inside and outside Ali Group to make technical exchanges with everyone through speeches, QA and open discussions.
Hao zhihui’s team is involved in many computer vision technologies, including object detection, recognition, segmentation, geometric reconstruction, visual localization, and so on.
Collection of AUtonavi POI data
Autonavi has more than 70 million POI (Point of Interest) data. Every year, there are many new POIS, and some POIS will close down or close down. ** How are these POIS made and updated? ** From the perspective of collection methods, there are many ways to obtain POI. An important and intuitive collection method is that Autonavi collects images of street shops through crowdsourcing and extracts POI data from images by computer vision technology (and manual assistance).
The following figure illustrates a crowdsourced collection process. Autonavi collectors walk down the street, taking continuous images. Finally, upload the image and GPS coordinates to Gaud.
The diagram below shows a POI from collection to production to use. The input is a continuous collection of images. For the production process, the most important thing is to calculate the content and location of each POI. It then matches the POI in the mother library to determine whether the POI already exists or needs to be added. Computer vision technology is needed to recognize the name of POI from the image and calculate the coordinates.
This article mainly introduces the name part. In fact, Autonavi’s POI production is not fully automated, but a combination of human and machine. When the machine can not be automated, or low confidence, to manual operation.
Autonavi’s POI data collection – production – use process diagram
The colorful presentation of POI data brings challenges to the automated processing process, including: text recognition, whether it is POI, the relationship between the text, how to name (name)…
The following picture is an example, from the original image, to automatically generate the name of the POI, including the following several key computer vision technology: natural scene text recognition, text attribute judgment and structured processing, automatic name generation…
Natural scene text recognition
Text recognition, to put it simply, is to find the text in a picture and give the correct characters. From the perspective of the development of the character recognition problem, it contains different sub-problems.
First of all, we are familiar with the term OCR, which translates to optical Character recognition in Chinese. The original intention is to use an optical scanner, read the printed text into binary data, and then identify ASCII characters, output out.
OCR has a long history. In the 1980s and 1990s, there were many research papers and commercial products. For example, Yann Lecun, one of the founders of deep learning we are familiar with, used neural networks to recognize handwritten zip codes in the early 1990s, which was commercialized by the Bank of America.
With the development of text recognition technology, its application range has become more and more. In addition to print, handwriting, any ordinary picture containing text, can be recognized?
In the middle column below, the problem is called born-digitial, which means that the text is computer-generated and the font and layout of the text are relatively fixed.
The third column is the text recognition problem of the natural scene, called STR, that is, the real text, such as the name of the shop, the road sign. This text recognition problem is the most difficult because of the Angle of taking photos, lighting problems and picture quality problems. It is also a type that is studied more in the academic circle.
Of course, today’s STR technology faces many challenges, including font problems, typography problems, multilingual problems, and lighting and blur problems caused by photography.
The words on a shop door plaque are more complex than those in other scenes, because it wants to express its own characteristics and make you “photographic memory”, so it is more prone to various artistic characters and different decorative effects.
In addition, AMap has to maintain POI data of the whole country. Its place names, shop names and brand names in different cities are themselves a very large thesaurus.
STR Technology Development: Traditional Algorithms (before 2012)
Let’s start with a brief introduction to STR technology.
The development of natural scene character recognition (STR) can be roughly divided into two stages, with 2012 as the watershed. After that, it entered the stage of deep learning algorithm.
Before 2012, the mainstream algorithms of word recognition rely on traditional image processing technology and statistical machine learning methods to achieve. It is divided into two parts: text line detection and text recognition.
Text line detection is generally preprocessed, which uses binarization, connected domain analysis, MSER significance region operator and other algorithms to locate text area, extract text line candidates, and then remove invalid candidates through classification.
Character recognition, generally by cutting, find character/word candidates, and then through machine learning classifier, each character/word classification.
The traditional text recognition method can achieve good results in simple scenes, but the parameters of each module need to be designed independently in different scenes. In complex scenarios, it is difficult to adjust parameters to obtain a model with good generalization performance.
STR Technology Development: Deep Learning Algorithms (After 2012)
Since about 2012, STR, like other computer vision problems, has entered a deep learning phase.
The two sub-problems of text line detection and text recognition mentioned above are solved by some deep learning models respectively. Here are a few typical jobs.
On the far left is a text line detection model, Textboxes++ of huazhong university of science and technology. Based on the network structure similar to SSD, it makes regression of the coordinates of the four vertices of the quadrangle respectively, so as to solve problems such as aspect ratio and rotation.
In the middle is a sequence recognition model. It can be said that this is a new problem solving method after the stage of deep learning. To input an image of a character sequence, the traditional scheme is to cut it into a single character or word, and then classify them. With the RNN model like LSTM, the feature sequence before and after can be encoded, and then a complete sequence recognition model can be trained by introducing CTC Loss.
In addition to transforming the two links of text line detection and recognition into deep learning schemes, there are also some efforts to integrate the two to form an end-to-end scheme. What is the purpose of integration? It is not hard to imagine that if you can identify the content of the text, in theory you should be able to detect more accurately. For example, if the three words “deep learning” can be recognized, is it possible to tell the detector through some kind of network feedback signal that there should be a word “learning habit” behind it?
The third column is an end-to-end work. It connects the Faster R-CNN and LSTM to the same network, and makes classification and coordinate regression for each proposal as well as character recognition.
Autonavi’s STR technology
In fact, Autonavi’s technology in STR is also divided into two parts: text line detection and character recognition.
In practice, the “end-to-end” model is not used, because the modular model is easier to optimize local effects, such as adding samples for a module, or changing a model.
In the link of character recognition, it can be seen that Autonavi uses two schemes in parallel. The upper branch is the detection and recognition of single characters, and the lower branch is the recognition of entire text sequences.
Autonavi STR technology – text line detection
First look at text line detection. In the early days, before 2017, semantic segmentation models such as FCN and Deeplab were used to split text lines.
After the appearance of Mask R-CNN in 2017, the Instance segmentation technology becomes more and more mature. We find that the effect of Instance segmentation on text line detection exceeds that of semantic segmentation model. Most importantly, because each line of text is identified individually, instance splitting naturally solves this problem.
Semantic segmentation, on the other hand, requires a lot of post-processing to distinguish between different lines of text.
Of course, in addition to Mask R-CNN, we will also use other instance segmentation models.
On practical business issues, Autonavi’s text line detects the effect. Whether the text lines are dense or blurred, the detection results are very high.
Autonavi STR technology — text recognition
Text recognition, Gaudad actually uses two branches, single character detection recognition, and sequence recognition. The final identification result is the fusion of the outputs of the two branches.
Why have two branches?
As you can see from this example, for “one, two, three, four of one”, the single character is not easily detected precisely because it is easily confused with the background.
But because it is in the middle of a line of text, it can be identified by the entire sequence.
Is it ok to just rely on sequence recognition and get rid of single-character branches? Or what could go wrong? I’ll leave that to the students to think about.
Sequence recognition model
Early sequence recognition models mainly used LSTM + CTC Loss, which was later replaced by LSTM with attention Layer. The introduction of attention makes the network more focused on feature input at each timestep output, resulting in better prediction.
Through these methods, the recognition of different fonts, different directions, and even different languages is good.
Mining and generation of Hard case
In practical work, in addition to model design and optimization, there are also many other problems.
A big problem is that Chinese characters have a lot of characters. There are about 3000-5000 characters in common use, but the characters seen in the POI are much more than that.
Amap has 70 million POIS, so you can imagine what that number would be.
We have several different solutions to this problem. For example, you can find the character you are interested in from the name of POI, and then find the collected picture, and hand it to the manual to mark. You can also use your computer’s font library and add some rendering effects to synthesize some samples.
Autonavi has been developing its text recognition technology since around 2016, and it is still optimizing it. The Autonavi technical team also participated in several competitions to test technical competence. A big competition in the field of OCR is ICDAR. Autonavi participated in the text line positioning and character recognition competition in 2017 and 2019, and also has some good results.
Text attribute determination and structured processing
After the text in the scene is detected and identified, you need to determine which text is associated with the POI name. Therefore, you need to determine the attributes of each line of text; At the same time, adjacent lines of text are often related, and their relationship needs to be calculated for structured output.
Text attribute determination problem
This question is challenging. Whether a line of text is a POI name depends on its text content and position. The above picture shows an example, just look at “member Internet access, 2 yuan/hour”, you can basically guess that this is not a POI name. And the same is “Best Express”, when it is in the shop above the plaque, it is most likely a POI name; It was not the title we wanted to make when it was on the express truck.
One of the most straightforward tasks to determine text attributes is noise reduction: exclude obviously invalid POI text. Autonavi uses a two-channel convolutional neural network of image and text to achieve a relatively obvious effect of noise reduction.
Now that you can classify lines of text as POI names and noise, you can extend it to categorize the text associated with POI names into multiple attribute categories, including main name, branch name, business area, contact information, and so on. In the production of POI name, the human labor union according to certain process specifications, select a part of the text, also according to the attributes of these texts, automatically select or discard, and sort, finally generate POI name.
In addition, Gaud also introduced semantic segmentation of plaques to determine the independent boundaries of each plaque. In the case of boundaries, the primary name is unique.
Automatic name generation
Finally, take a look at the automatic name generation problem and the solution.
How do YOU automatically generate POI names after text recognition and attribute determination? After mastering the process, one can infer the correct name of the shop from the listing (which is also the basic function of a qualified listing). So can a machine learn and master naming rules to generate names from a listing?
In the real world, this is not a trivial problem. For example, what is the correct name of POI for the brand below?
Here are some more examples:
Name automatic generation model
As above, the input is multiple lines of text, and the output is the labels of those lines (whether they were selected as part of the final name), and the order. If you ignore the image information, this is an NLP problem. BERT model can be used for training. The problem was defined as a two-task learning problem, including classification task and regression task.
Add image information to the model. The input is the bounding box of all text lines, which is encoded into features using a Graph Attention Network and connected with BERT model features. Finally, the learning ability of the model is improved.
Further, inspired by the work at Microsoft, Gaud also used the VL-Bert model. The quality of name generation was eventually improved to 95%.
These are some of the effects generated automatically by the name. In the first three examples, the models learned the rules of name generation fairly well, despite the differences in typography.
Of course, in the bad case shown in the figure, the prediction of the model can be problematic when the listing format is not common. This is also the direction of optimization later.
About Autonavi Vision Team
Composed of outstanding scientists and engineers located in Seattle, Silicon Valley, Beijing and other places, amAP vision algorithm core team. Solve puzzles and explore innovative technologies for a new future of mapping, navigation and mobility. Covering image understanding, video analysis, multi-source fusion and other technologies, oriented to map and high-precision map making, positioning, traffic and prediction, AR navigation, assisted driving and infotainment and other fields. It is the core engine of amAP’s advanced technology development.