takeaway

POI stands for “Point of interest”, which can be translated as “Point of interest” in Chinese. On a map, a POI can be a house, a business, a bus stop, a lake, a road, etc. In the map search scenario, a POI is a retrieval object, equivalent to a web page in web search. On the map client, the user selects a POI and a floating balloon points to the POI.

As shown on the left, the Watsons in this mall is a POI; The so-called category label is a generalization of POI attributes in the category dimension. For example, the category label of Cosmetics of Watsons, while the category label of Cade Mall where Watsons is located is the shopping mall. On the right is a list of POIs recalled by the store Query search, all with category attributes matching query.

The figure above also shows the two main usage scenarios for category tags: To provide rich information and support decision making, on the one hand, on the front end showed more abundant information for the user, on the other hand support search category search requirements, mainly in the map scene query and POI both sides have a wealth of polysemy, through the traditional text matching engine or synonymous with simple generalization is difficult to achieve a goal, Therefore, labels are mined as a basis for recall and sorting.

Our category system is mainly based on the following points:

  • Users’ actual Query expression is mainly to support users’ search requirements.

  • The objective distribution of categories in the real world and PM’s perception of this distribution;

  • Affiliation and juxtaposition between different labels.

Finally, each category will build a multi-level multi-fork tree system, such as the classification of shopping categories:

Difficulties in the construction of category labels

Our goal is marking, that is, mapping the POI to each node of the category tree system above. This is obviously a classification problem, but not a pure classification problem:

  • Multi-label problem: Watsons labels cosmetics, which is a one-to-one mapping; Some POI may have multiple labels at the same time, such as Tangquan Liangzi, which can be used for bath, massage and foot treatment. Xx furniture store, hit the furniture store label at the same time, must hit the parent node home building materials label. Overall, this is a multi-label problem, not a multi-category problem;

  • Text-related issues: Most POI have intuitive text titles, such as Niu Niu Electric Car, Haier Exclusive Store, Dongying Ming Tea, Xiyan Jingyi, and Newborn Noble. Correct results can be predicted through text analysis of names. On the other hand, it is not a pure text problem, for example, apple monopoly, only from the text can not confirm whether it is a mobile phone store, or a fruit store; There are also some expressions, such as old five wholesale, low frequency expression or no category information, which need to introduce other features to solve;

  • Comprehensive problem: algorithm can solve the main problem, but the complexity of the real world, through a simple algorithm is difficult to completely cover, such as bars in nightclubs and clear distinction of 3 armour hospital, auto 4 s shop marking, low brand recognition, etc., by limited samples and characteristics is not qualified, but it cannot be ignored.

In addition, the application side also has high requirements for the accurate call and output rate of labels: low marking accuracy may lead to the recall of wrong POI when users search; If the coverage is low, users’ expected results may be omitted. There are 20+ large categories to be built, and each large category has dozens of sub-labels, with a total of thousands of size labels. It is necessary to use high speed and high efficiency, quasi call are guaranteed method for marking, in order to effectively land income.

To sum up, the main problem of category label marking we need to solve is a multi-label classification problem, which mainly uses text for identification, but it is necessary to introduce other non-text features or means to solve it satisfactorily.

Technical solution

Overall scheme design

As shown in the figure, in order to efficiently complete the marking, we designed the main process modules, which are described as follows:

  • Feature engineering: Text features solve the most important problem of marking, but at the same time, POI texts in map scenes are short and widely distributed with long tails, with many low-frequency texts or low-frequency brands without category information, etc., while text descriptions such as reviews and introductions tend to be high-frequency, and the difficulty lies in solving low-frequency ones. Therefore, in feature design, some general features should be used as far as possible, such as POI name, Typecode (another set of classification system maintained by the manufacturer), source category (original part category of the data provider), brand and other general features. High-frequency proprietary features or data are generally not identified in the general model.

  • Sample engineering: sample mining and cleaning, as well as model design are also aimed at solving the problem of universal marking. The diversity of POI expression and the large number of labels at the same time lead to the large number of samples required. When the labeling cost is high, sample sources other than manual labeling must be considered.

  • Classification model: the simple text classification model cannot solve the problem of non-text, and the multi-label problem cannot be solved by the simple multi-classification model. We designed a variety of fitting business model transformation work;

  • Multiplex fusion: The classification model can solve the main problem, but not the whole problem. In the map scene, there are always some unsolved problems outside the algorithm, such as 5A scenic spots, third-class a hospitals and all kinds of brands, which are difficult to be handled by the model. We design multi-channel marking, such as brand library for the brand effect of the bottom, the introduction of external resources to solve non-algorithm problems in batches, the introduction of special mining to solve non-general marking categories and so on… In general, the problems outside the model are assisted by introducing external knowledge to mark, and the problems converge on the whole.

The main difficulties of the business and the main work on samples and models will be introduced in the following part.

The sample project

Sample sources & cleaning

In terms of samples, after some experimental demonstration, it is almost impossible to meet the requirements for manual labeling due to the large number of labels and the large sample size required for each label. Therefore, click logs and some external resources are mainly considered:

  • Click log data volume is large, can be generated in a cycle, and reflects the user’s most direct intention; The disadvantage is that it contains large noise, and users tend to click on high frequency, while low frequency expression is scarce.

  • The amount of data of external resources is small, but the diversity is good, which can make up for the insufficient expression of click data in low frequency.

Through the introduction of samples from these two aspects, we soon obtained millions of original samples. With such a large number of samples, cleaning is still a huge workload. In order to efficiently clean samples, we designed a two-level mode combining active learning:

After the introduction of initial samples in two aspects:

  • Firstly, the data is sampled and cleaned, and the cleaning process is abstracted into business rules to carry out global cleaning and iteration.

  • When the overall system and the obvious noise tends to converge, we divide the remaining data, extract part of the data for modeling each time, identify the other part of the data, and then manually annotate the part of the data that is not well identified. Do this over and over again.

In a similar way to active learning, the value of manual labeling can be maximized to avoid labor waste caused by repeated labeling of repeated samples with low information content. The following is a detailed introduction to the mining ideas of click samples.

Click sample mining

Search click log condensed the needs and wisdom of countless users, most of the search business can be mined from the most original training samples. Specific to the current mining business, the first problem to be solved is the inconsistency of sample expression. Specifically described as:

Click on data: Query -> POI Need sample: Tag -> POI Solution: Tag -> Query -> POICopy the code

As shown in the figure below, to mine the underwear sample set, the seed Query set of the label mapping is manually defined, and the corresponding Click sample can be recalled through this query set, which can be directly used as the label underwear sample.

In practice, we added the seed Query to the generalization set mapping, that is, the manually defined high-frequency Query set is generalized to a larger synonym set, and then the click sample is recalled from the synonym set. The starting point is:

High-frequency Query mainly focuses on high-frequency samples, and the difficulty to be solved is the mining of low-frequency expression. Therefore, the generalization of Query from high-frequency to low-frequency is carried out in order to recall low-frequency sample expression through low-frequency Query, such as the expansion of silk stockings to casual cotton socks, underwear to Victoria’s Secret, urban beauty and other aspects.

Query generalization procedure:

In query generalization, it is necessary to obtain the low-frequency expression of near-sense through the high-frequency set, and at the same time ensure that the excessive semantic diffusion will not lead to the deviation of the generalized set from the original semantic meaning of the tag. We mainly tried the following schemes:

  • Word2vec: Learn word granularity vector expression from click or external corpus, conduct maxpooling of multiple words in query to obtain query vector expression, and then search other query whose vector distance is close to query in the full set through query vector. The main problem of this method is that the query expression of embedding characterized by word granularity is not controlled in the generalization process, easy to introduce a lot of noise, difficult to clean, and there are obvious case missing.

  • Synonyms: This method is very limited in recall, and the resulting Query does not necessarily match the user’s natural expression.

  • Session context: The Session of map scene is generally short, with limited recall and semantic deviation, and the quasi-call is not high.

  • Recommended methods: Continue to consider click log mining, compare each query to user, click POI as item, consider introducing the idea of recommendation for similar Query mining — that is, queries that click on the same or similar POI have some similarity, and the relationship between query and POI click naturally forms a social network. Thus, two graph-based recommendations are referred to:

  1. SimRank, the bipartite graph is formed through the click relation between Query and POI, and the similarity between two nodes is weighted and averaged by all other points they are associated with. After repeated iterations, the similarity tends to be stable, and the similarity between two queries is obtained.

A and B represent the two nodes in the figure, and the self-correlation is defined as 1, that is, S (a, a)=1, I(a) and I(b) represent the set of nodes connected to AB. For different a and B, their similarity is described as follows:

  1. DeepWalk is a new way to learn the representation of nodes in networks. Language Modeling is applied to the network structure, so as to use deep learning to learn the representation of relationships between nodes in networks. Similar to simrank, click network is constructed through click relation between Query and POI, and random walk is carried out on the click network. Through learning the walk fragment, vector expression of Query is obtained, and then similar recall of Query is calculated through vector expression.

Among the two schemes of the recommended method, simrANK has a large amount of computation, and the experiment has gone through a long time of iteration on a small amount of data. However, the computation resources and computation time of tens of millions of click data will become extremely large, which is uneconomical in cost. However, the DeepWalk random walk method has achieved good results in practical tests:

For example, the original query is paint, generalization yields subcategories, brands, ETC, and ETC can yield some local naming expressions.

Different from learning the word granularity embedding expression after query segmentation, DeepWalk directly learns the whole query, namely Sentence expression, avoiding semantic deviation in the process of query segmentation into multiple word pooling. Moreover, the direct query granularity representation learning makes the mining results more consistent with the actual online expression, which is convenient for us to recall the operation of clicking POI samples in the future. Meanwhile, coarse-grained learning can better preserve the social relations between network nodes.

On the whole, the proportion of low-frequency expression in the sample set was significantly increased by the Query generalization step.

Model design and iteration

It is a common practice to use classification to solve the marking problem in similar scenarios. Specifically from the perspective of business, the marking data categories that we need to model to solve are mainly divided into four categories:

  • High frequency suffix recognition

  • Low frequency suffix recognition

  • Brand identity

  • Non-text recognition (typecode, source category, etc.)

For the first three categories, pure text classification model can be used, but non-text features cannot be integrated by text model directly. At the same time, the sample data has the characteristics of loud noise and uneven distribution in the early stage, which needs to be taken into account.

Stratified multiclassification

In the early stage, the multi-classification model was used to solve the classification problem. A multi-classifier was constructed for each non-leaf node, starting from the root, and the output at the last level was leaf node. Its characteristics are intuitive, and the label system is more matched.

This method has obvious defects: the scheme structure is complex and unstable, difficult to maintain, and the sample organization is rather complicated; The errors and omissions of the superior classification model will affect the classification effect of the subordinate layer by layer.

Each label is dichotomized

In the early stage of the business, due to the limited search effect before optimization, there is a certain amount of noise in the sample set. Meanwhile, the system will be constantly iterated and optimized during the investigation period. Therefore, a model with strong interpretation, compatible changes and direct support for multiple labels is needed.

Therefore, we tried to train the binary classification model for each label, and the POI to be identified was combined and used after the prediction of all binary classification models, and then the subsequent business logic construction was carried out. In this way, the multi-label business scenario was better solved and the conflict between multiple labels of the same POI was avoided.

In terms of samples, OVR method is used to organize samples, and the sample proportion can be adjusted (proportion of positive and negative cases, proportion of source of negative cases, etc.), which is compatible with the imbalance of business samples.

A unified model

Single labels of binary classification model has good explanatory, and flexibility, each label modeling offers good match in the early period of the business requirements, rapid advance label effect to the ground, online bad case of convergence, compared with the original way of artificial expert rules at the same time, recalled significantly improve at the same time, the efficiency is more than ten times.

In the early stage, in order to promote the business quickly and quickly, the simple binary classification model was used to quickly solve the main problems. With the continuous improvement of the search effect, the business also has higher quality requirements, while the LR binary classification model still has some problems:

  • Use word bag features, dimension bloat

  • Feature selection results in low frequency feature loss

  • Generalization performance is mediocre and requires a lot of building business rules

  • Independent modeling, insufficient identification of relationships between tags (parent and child, co-occurrence, mutual exclusion)

  • A large number of negative samples are taken down and wasted

  • The number of models is very large and the maintenance cost is high

In order to achieve better results, it is necessary to upgrade the depth model. The commonly used models for depth text classification include CNN, RNN and various attention-based models. The attention-based model mainly solves the deep semantic understanding of long texts, while our business scenes need to solve the generalization of simple semantics of short texts. At the same time, considering the need to combine with non-text features and classification efficiency, we choose textCNN model.

TextCNN is a multi-classification model of pure text, which cannot solve the multi-label situation and is not compatible with non-text features. Meanwhile, the problem of unbalanced samples also leads to insufficient learning of small sample categories. In order to match the actual business, we made business transformation of the original textCNN, as shown in the figure:

After the original textCNN conducts word embedding and convolution kernel pooling for the text, it directly conducts softmax multi-classification, while under the current business:

  • After the text representation is pooled, other external non-text features are joined together, and deep feature extraction and representation are used for the text, while simple non-text features are connected in a way similar to Wide&Deep to participate in the prediction.

  • Softmax can only output normalized prediction results, not applicable to multi-label scenarios. By modifying the output layer, multi-channel SIGmoID output is used, and each channel output corresponds to the prediction result of a label.

  • The case of multiple labels in the same sample is compressed and the multiple labels are merged into a vector.

  • By redesigning the loss function, the multi-classification cross entropy is not applicable to the scenario with unnormalized output of multiple dichotomies.

In this way, the problem of multi-label scene & non-text feature access is solved, and the depth model is used to achieve better generalization effect.

On the other hand, all samples were trained in the same model, and the original sample imbalance caused some negative effects. For example, there are hundreds of thousands of samples of major household building materials, clothing, shoes and bags, while only thousands of samples of labor protection products, beauty and hair products and other labels. It is easy to directly discard small sample sets that have little influence on the overall accuracy rate during training. Therefore, the weight of each sample set is calculated:

Ck is the sample number of k category, h is smooth hyperparameter. In effect, the smaller the sample number of category, the larger the weight. Add category weights to the loss function:

λ1 and λ2 represent regularization constraint gravity factors for text feature and non-text feature respectively.

After transformation and optimization:

There is a large gap between the sample numbers of the above labels, but the classification effect is maintained at a relatively high recall rate. In addition, we developed a model interpretation tool based on the modified model. For some cases that do not meet expectations, the interpretation tool can be directly observed:

  • The main factor in classification is text feature or non-text feature, and the specific feature and its value display;

  • If it is a text feature, the positions of several convolutional Windows with the largest weight are displayed, as well as the weight of each term in the specific window

This tool allows for lR-like interpretation when using depth model predictions, facilitating positioning and iteration.

On the whole, after the above business transformation and iteration of textCNN, the random data accuracy rate is increased by more than 5%, while the business rules are reduced by 66%. At the same time, obvious effects are achieved in long-tail case(low-frequency text and brand).

Related work

The previous section mainly introduces some of our work in category tag construction.

On the other hand, when category tags take effect in map searches, you need to identify which queries need to be recalled using tags. In the early stage, we manually marked the corresponding relationship between query and tag in amap search header. However, manual labeling always has limited coverage:

  • Unable to obtain the label recall benefit of middle-low frequency Query, so the label utilization rate is not high enough.

  • For query that should use label recall, if text recall is used because of failure to recognize it, the online effect will also be affected, reducing user experience and causing badcase.

Therefore, after completing the system construction, we built another model, mainly using text semantics and click features, to identify the corresponding recall relationship between Query and label, that is, identify which query can be recalled by category label. Through this series of work, we dug deep into the label recall of middle and low frequency. Tag recall accounts for 94% of category search traffic on Amap.

earnings

After the label data is produced, two aspects need to be evaluated:

  • From the quality of output data, that is, the accuracy and coverage rate of its own label:

High-quality data is not only conducive to the business support of search, but also conducive to the promotion and implementation of our category label system in the whole BU.

  • Evaluation of the search effect improvement of the actual query in online extensive search after going online:

In addition to the evaluation of data quality, the search effect will be evaluated based on the new tag data, that is, the comparison effect of the new tag system and the old tag system will be evaluated. In the artificial effect evaluation, GSB effect, the online data of each category has brought a very obvious improvement in the quality of category search, so that the search is more accurate and more comprehensive to assist users to make decisions.

summary

The focus of the current work lies in the use of universal features to solve the subject category marking problem. For the problem of unsolvable universal features, it is usually solved through the construction of external knowledge and resources, such as brand library construction, a-level scenic spot resource collection and so on.

In fact, in addition to the use of general features, the application of non-general features to improve the classification effect of some data is not sufficient. A series of special optimization should be arranged in the follow-up, such as:

  • Comments, introduction and other features, using some Attention method, may achieve a better complementary effect;

  • POI images often contain some category-related information, which can be fully used for image recognition.

  • Introduction of external encyclopedia, knowledge map and other knowledge, assist in the construction of low and medium frequency brand library.

Business closed loop construction should continue, such as the flow of bad cases and the construction of automatic repair mechanism, the discovery of new brands, new labels and other problems, so as to avoid the effect degradation of the marking system after long-term operation.

Regardless of machine learning or external resource-assisted methods, the massive long mantissa data are often weak, and many POI features on the actual line are quite lacking, with only a simple name, from which it is difficult to accurately predict their categories. How to guide users to submit their own category information, or crowdsourcing to complete the labeling of category labels, is also a solution that we need to focus on in the future.