Introduction: Autonavi ensures that every new POI in the real world is produced into data in a timely manner through its own massive image source. Over a short time interval (less than a month), the change in POI in the same place is very low.
The author, the source of pregnant you | | spirit cage ali technology to the public
A background
POI stands for Point of Interest. On digital maps, POI stands for restaurants, supermarkets, government offices, tourist attractions, transportation facilities and so on. POI is the core data of electronic map. For ordinary users, POI data contains name and location information, which can satisfy their basic needs of using electronic maps to “find destinations” and thus evoke navigation services. For e-map, users’ active time can be increased by providing operations such as “search nearby” and “comment”. In addition, POI data is a link between online and offline interaction, and is an important component of Location Based Service industry.
Autonavi ensures that every new POI in the real world is produced into data in a timely manner through its own massive image source. In a short time interval (less than monthly), the change of POI in the same place is very low. As shown in the figure below, only “Tang Fire Kung Fu” POI is a newly added listing.
Figure 1. Comparison of POI plaques at the same place at different times
If all THE POI are processed, it will bring high operating cost, so it is necessary to automate filtering the POI that does not change. The key technical capability is image matching. This scene is a typical image retrieval task.
1 Technical Definition
Image retrieval problem definition: Given a Query image, similar images can be searched in the large image Gallery by analyzing the visual content. This direction has been a long-term research topic in the field of computer vision, and has been extensively studied in pedestrian re-recognition, face recognition, visual positioning and other tasks. The core technology of image retrieval is metric learning, whose goal is to narrow the same category sample and push the different category sample in the feature space of fixed dimension. In the era of deep learning, there are mainly several classical structures, including contractive Loss, Triplet Loss, center Loss, etc., which are optimized through positive and negative sample definition and loss function design. In addition, an essential element of image retrieval is feature extraction, usually including: global feature, local feature, auxiliary feature, etc., mainly for the corresponding optimization of different task characteristics, such as: Pedestrian re-recognition and face recognition have strong rigid constraints and have obvious key features (pedestrian/face key points), so human body segmentation or key point detection information will be integrated into the model feature extraction.
2. Features
POI plaque image retrieval is quite different from mainstream academic retrieval tasks (such as pedestrian re-recognition), mainly including the following points: heterogeneous data, serious occlusion and text dependence.
The heterogeneous data
The pedestrian re-recognition task also has the problem of different source data, but the different source of this task is mainly the difference of different camera shooting and different scenes. However, in the POI plaque retrieval scenario, there is a more serious problem of heterogeneous data, as shown in the figure below:
FIG. 2. Heterogeneous images under different shooting conditions
The image on the left is from a low quality camera and is shot forward; The image on the right is from a high-quality camera and shot sideways; Due to the different camera shooting quality and perspective, the brightness, shape and clarity of POI plaque are very different. However, it is a very challenging problem to realize THE POI plaque retrieval in the heterogeneous data.
Keep out serious
In the road scene, there is often interference information such as trees and vehicles, and due to the shooting Angle, the POI plaque photographed often faces serious occlusion problem, as shown in the picture below:
Figure 3. Example of a badly shaded POI plaque
Moreover, the occlusion scene is irregular, which makes it difficult to properly align the features of the two plaques, which brings great challenges to THE RETRIEVAL of POI plaques.
Text dependency
Another unique feature of POI plaque is its strong dependence on text, mainly on the text of POI name. In the scene below, the overall layout and color of the two plaques are very similar, but the POI name has been changed. In this scenario, we hope that the two plaques do not match, which requires the introduction of text features to enhance feature differentiation. However, due to occlusion causes, text features are also different, so it is necessary to combine image features to make trade-offs. In addition, text features and image features come from multiple modes, so how to fuse multi-mode information is also a unique technical difficulty of this business.
Figure 4. Example POI plaque with text changes only
Ii. Technical Scheme
The technical scheme of plaque retrieval mainly includes data iteration and model optimization. In the data generation part, we divide into two steps: cold start automatic data generation and model iteration data generation. In the model optimization part, we designed a multi-modal retrieval model, including the visual branch and the text branch, mainly considering the rich text information of the plaque, so the visual information and the text information are integrated. For feature extraction of visual information, we further design global feature branch and local feature branch, and optimize them respectively. The overall technical framework is shown in the figure below:
Figure 5. Overall technical scheme
Firstly, the traditional matching algorithm Sift was used to automatically generate the training data required by the model to complete the cold start of the model. After the model goes online, the online manual operation results are automatically mined and organized into training data for iterative model optimization. The multi-modal retrieval model is designed based on the metric learning framework of Triplet loss. The inputs include: 1) the image information of POI plaque; 2) Text information of POI plaque. Double branches are used for feature extraction of image information and BERT is used for feature extraction of text information. Finally, text features and visual features are fused.
1 data
In order to train the retrieval model, it is usually necessary to annotate at the instance level, i.e. by POI plaque granularity. However, screening the same POI plaque from different materials is a very complicated work. If manual labeling is carried out, it will bring high labeling cost and cannot be large-scale labeled. Therefore, we designed a set of simple and efficient automatic generation of training data, which can be used for the cold start of the model without any manual annotation.
We use the traditional feature point matching algorithm for reference, Sift feature point matching algorithm is used to match all plaques in the two data in pairs, and the matching results are screened by the number of interior points, that is, the matched plaques with the number of interior points greater than the threshold are regarded as the same plaque. Generally speaking, the traditional feature point matching algorithm has the problem of insufficient generalization, and the resulting training data is likely to make the model difficult to learn, which is embodied in: 1) the training samples are relatively simple; 2) Category conflict, that is, the same plaque is divided into multiple categories; 3) Category error, that is, different plaques are divided into the same category. Therefore, we made corresponding optimization for this problem: 1) Using data matching results of multiple trips to improve the diversity of plaques under the same category; 2) Batch sampling strategy and MDR Loss [2] are adopted to reduce the sensitivity of the model to mislabeled data.
Specifically, for the diversity of samples problem, we use more data matching results to generate the training data, because the same plaque exists in different data more than shooting results from different perspectives, it ensures the diversity of plaque under the same category, to avoid the automatically generated samples for simple sample problem. Batch sampling policy is to sample data by category, and the total number of categories in the data is much larger than Batch size. Therefore, the problem of category conflict can be alleviated. Based on Triplet Loss, MDR Loss designs a new metric learning framework that regularizes constraints according to different distance intervals, so as to reduce overfitting of model to noise samples.
Figure 6. Schematic diagram of MDR Loss. Compared with Triplet Loss, distance regularization constraint is added
Figure 6 is the schematic diagram of comparison between Triplet Loss and MDR Loss. MDR Loss hopes that the distance between the positive sample and anchor will not be pulled to infinity, while the negative sample does not want to be pushed to infinity. In terms of categorical error noise samples, different plaques are mistakenly classified into the same category. According to the optimization objective of Triplet Loss, the model will be forced to learn the distance between the two to infinitely close. In this case, the model will overfit to the noise sample, resulting in poor final effect.
2 model
In order to optimize the tablet retrieval effect, we integrated the visual information and text information in the tablet, and designed a multi-modal retrieval model. For visual information, we optimize the global feature and local feature extraction ability of the model. For text information, BERT was used to encode OCR results of plaques as auxiliary features, and metric learning was carried out after fusion with visual features.
The global feature
In general, global features extracted by deep learning model are more robust for retrieval tasks, and can adapt to different scenes such as plaque Angle of view, color and lighting change. In order to further improve the robustness of global features, we mainly optimized from the following two aspects: 1) Attention mechanism was adopted to strengthen the Attention to important features; 2) Network backbone improvement, in order to pay attention to more fine-grained features.
In our business scenario, there are some plaques with similar appearance but different details, as shown in Figure 8 (c). In this case, we hope that the model can pay attention to the fine-grained information in the plaques, such as the font, text layout or text content itself. The attentional mechanism helps the model to accurately focus on the more critical parts of a large amount of information that distinguish different plaques. Therefore, we introduce an attention module in the network to make the model learn key information to improve the ability of global feature discrimination. We adopted Spatial group-wise Enhance (SGE) [4], which adjusted the importance of features at each Spatial position by generating an attention factor for each Spatial position in the feature map. Figure 7 shows the SGE module. It first groups the feature images, and then calculates the semantic feature vector for each group of feature images. The position-wise dot product between the semantic feature vector and the feature images is used to obtain the attention diagram, and then the position-wise dot product between the attention diagram and the feature images is carried out to enhance the features. In order to obtain better spatial distribution of semantic features.
FIG. 7. SGE schematic diagram, introducing spatial attention mechanism
In order to reduce the loss of local features, we improved network Backbone, cancelled the downsampling in the last block of ResNet network, so that the final feature map contains more local information. In addition, we use GeM[3] pooling layer to replace the last global Average pooling. GeM is a kind of feature aggregation method that can be learned, and global Max pooling and global Average pooling are its special cases. Using GeM pooling can further improve global feature robustness.
Local characteristics
After optimization for global features, the existing model still does not perform well in the following three aspects: 1) In the case of truncated plaque, the feature learning quality is poor, as shown in Figure 8(a); 2) The occluded plaque introduces some irrelevant contextual information into the features, as shown in FIG. 8(b); 3) Similar but different plaques are indistinguishable, as shown in Figure 8(c). Therefore, we further designed a local feature branch [1] to make the model pay more attention to local information such as plaque geometry and texture, and perform plaque retrieval together with global features.
(a)
(b)
(c)
Figure 8. Different examples requiring local feature optimization, (a) truncation, (b) occlusion, and (c) text changes
For the extraction of local features, our main idea is to vertically cut the plaque into several parts, respectively pay attention to the local features of each part [7], and optimize the local features after alignment. Alignment operation are shown in figure 9 below, the first will be the figure characteristics of vertical pooling, get the local characteristics of block diagram, and calculating the similarity between the two picture partial feature matrix, and then according to the formula 1 will find the shortest distance of two pieces of image alignment, among them, I, j, respectively in the ith two pictures of characteristics and characteristics of the first j block, Dij represents the Euclidean distance of the features of block I and block J in the two graphs.
Formula 1. Local alignment calculation formula
Figure 9. Local alignment of POI plaque
In this way, local feature alignment can improve the retrieval effect of plaque in the case of truncation, occlusion and inaccurate detection frame.
The text characteristic
POI plaque is strongly dependent on the text, and there may be a scene where only the text of the plaque name changes. Although the global feature branch and local feature branch designed by us can learn text features to a certain extent, text information accounts for a relatively small proportion in the overall information, and the supervision signal is only about whether the two graphs are similar, so the text features are not well learned. Therefore, we use the existing text OCR recognition results, and introduce BERT to encode the OCR results to obtain the text features, which are fused as auxiliary feature branches and visual features, and the fused features are used in the final measurement learning of plaque retrieval. It is important to note that in the plaque to extract the OCR results, in order to reduce the influence of the recognition results are not allowed to be in a single frame, we use a data within the same trip multiframe OCR result of plaque, and will get the OCR results are joining together, using BERT feature coding, with the result of OCR on OCR results from different frames do distinguish inserted between symbols.
3 Model Effect
Under the new technical scheme, POI plaque image retrieval has achieved very good results, accuracy and recall rate are greater than 95%, greatly improving online indicators, and model speed has also been greatly improved. We randomly selected some matching results, as shown in Figure 10.
Figure 10. POI plaque retrieval results randomly selected from the evaluation set
In the process of optimization, some very difficult cases were gradually solved, as shown in Figure 11 below:
FIG. 11. Examples in the evaluation set show that (a), (b) and (c) are the error retrieval results before optimization, and (d), (e) and (f) are the retrieval results after optimization
Figure (a), (b) and (c) show the Bad case before optimization (query image on the left, Rank1 retrieval result on the right). From the Bad case, it is not difficult to find that the plaque retrieval has a very high requirement for fine-grained feature extraction, because these cases are generally characterized by overall similarity but local features are different. These Bad cases are the original intention of the multi-modal retrieval model designed by us, and are gradually solved in the optimization process, as shown in Figure (d), (e) and (f). We proposed the multimodal retrieval model based on the characteristics of global optimization and alignment with local characteristics, makes the model more attention to the plaque is more distinct local characteristics, such as text, text font, plate, plaque texture, etc., so our model for similar appearance has better ability to distinguish between different plaque, as shown in figure (a) and figure (d) contrast effect. In addition, it is very difficult to retrieve some plaques only using visual features due to the occlusion of plaques from different perspectives, different light intensity during shooting, and large color difference between different cameras. Therefore, we added OCR information through the auxiliary feature branch to further enhance the robustness of features, so that the plaque retrieval can take the visual information of the plaque and the text information in the plaque into comprehensive consideration for retrieval, as shown in Figure (b) and Figure (e) for effect comparison.
Future development and challenges
Image retrieval is an attempt in the automatic production of amap data, which has achieved good results and has been used in actual business. However, the model is not perfect, and there will still be Corner cases. To solve these cases, we will discuss in the future from two aspects of semi-supervised learning/active learning, automatic data supplement, and introducing Transformer[9,10] to optimize feature extraction and fusion.
Data: Data mining based on semi-supervised learning/active learning
Data is very important, because the model is difficult to be perfect, and there will always be Corner cases, and a very efficient means to solve Corner cases is targeted supplementary data. The key to supplementary data is how to mine Corner case and how to automatically mark. This direction is also the current academic research hotspot, namely semi-supervised learning and active learning. Semi-supervised learning uses the model trained with label data to generate pseudo labels for massive unlabeled data, and then mixes the label data with pseudo label data to optimize the model. Active learning is to use the model trained with labeled data to mine massive unlabeled data and label the mined valuable data manually. Whether you need both the difference is that part of the manual annotation, a semi-supervised learning is fully generated by the model itself, but may cause model ceiling effect, and active learning can to a certain extent can improve the limit, so the future need to study a combination of both, to supplement the training data better, solve the Corner case.
2. Model: Feature extraction and fusion based on Transformer
Transformer is currently a hot topic in academic research, and a lot of work has proved its effectiveness in classification, detection, segmentation, tracking and pedestrian re-recognition tasks. Compared with CNN, Transformer has the characteristics of global receptive field and high-order correlation modeling, which makes it have better characterization ability in feature extraction. In addition, Transformer has more flexible input and can easily encode other modal information and input it into the model together with image features, so it also has great advantages in multi-model feature fusion. To sum up, Transformer can solve the matching effect of POI plaque in occlusion/truncation scenes through correlation modeling of image patches, and realize the fusion of multiple model features through text feature coding.
References:
[1] Zhang X, Luo H, Fan X, et al. Alignedreid: Surpassing human-level performance of human identification, Surpassing identification[J]. ArXiv Preprint arXiv:1711.08184, 2017. [2]Kim, Yonghyun, And Wonpyo Park. “Multi-level Distance Regularization for Deep Metric Learning.” arXiv preprint arXiv:2102.04223, 2021. [3]Radenović F, Tolias G, Chum O. Fine-tuning CNN image retrieval with no human annotation[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 41(7): 1655-1668. [4]Li X, Hu X, Yang J. Spatial group-wise enhance: Improving Semantic Feature Learning in Convolutional Networks [J]. ArXiv Preprint arXiv:1905.09646, 2019.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.