1. Background
Accurate and direct knowledge of the ability to question and answer is very important to create a “knowledge, understand you” image. In the voice assistant scenario, polysemy or colloquial expression of a word often occur. For example: Li Bai out of costume, Li Bai’s poem, play Li Bai. The first refers to the game character, the second refers to the poet li Bai, and the third refers to the song Li Bai. How to accurately identify what the user is pointing at and give the correct answer is the challenge faced by The assistant.Knowledge mapping is the cornerstone of machine understanding of the objective world, with strong expression ability and modeling flexibility. Currently, OPPO’s self-built knowledge graph has accumulated hundreds of millions of entities and billions of relationships. Now let’s see how bubus and Ograph will collide, solving entity ambiguity with entity linking technology, and helping bubus make an intelligent assistant that can hear, speak, and understand you better.
2 Task Introduction
Entity link is a basic task in the field of NLP and knowledge graph, that is, for a given Chinese text, the task of associating the entity description (Mention) with the corresponding entity in the given knowledge base.
The entity linking task was first proposed at the TAC conference in 2009. Until deep learning became popular in 2014, entity linking was done through statistical features and graph-based methods. In 2017, the Deep Joint scheme was proposed to use Attention structure for semantic matching to disambiguate entities. Later, the team realized entity recognition and disambiguation of the same model at the same time through model structure innovation. Disambiguation still uses Attention. In 2018, the Deep Type scheme was proposed to transform the disambiguation problem into entity classification. After obtaining the entity category, the entity to be linked to was determined through Link Count. In 2020, pre-training language model becomes popular. Entity Knowledge scheme uses a large amount of corpus and powerful pre-training model to link entities in the way of sequence annotation.
3 Technical Solutions
Entity linking is typically broken down into three subtasks: entity recognition, candidate entity recall, and entity disambiguation.
3.1 Entity Recognition
Entity recognition is used to identify entity description (mention) in query. For example, Li Bai in who is the emperor of Li Bai’s dynasty is the mention to be identified. In general domain entity recognition, entities are of large magnitude, of many types and of various statements, so the scheme should give consideration to both efficiency and generalization.
Word Parser is a self-developed matching tool for entity recognition based on dictionary, which has advantages over open source tools both in performance and function.
Entity link does not care about entity type, so entity recognition can use B/I label or pointer annotation mode. Meanwhile, in order to improve the richness of input information, lexical information is introduced as feature supplement, Lattice LSTM and FLAT are tested, and entity recognition effect is improved by about 1%.
3.2 Candidate entity recall
Candidate entity’s goal is to use mention recall recall all of the possible candidate entity, but online there are many nicknames, homophone, colloquial expression, such as volume, shaking, this kind of pet name alias, but also can appear state error, such as plump altman, tractors altman, qian set, if the lack of alias mapping to the correct entity, will not be able to complete the recall. In voice scenarios, there is no direct user feedback, and it is not possible to mine through user clicks like a search, so mining entity aliases is quite challenging.
According to the characteristics of the problem, we construct two sets of schemes. The general alias mining adopts the mining process of information extraction as the main mode generation as the auxiliary, and makes full use of the description information, alias information and relation information in OGraph to mine the aliases.
In view of user input errors, homophones, and similar words category name mining, Xiaobuassistant innovatively constructed an alias discovery process based on feature clustering. The mining steps are as follows:
1) Query filtering: Select queries that may contain aliases to be mined from user search queries and online queries by domain keyword filtering and search click log filtering.
2) Entity recognition: use entity recognition technology to identify entities from query to be mined. The entity recognition model is obtained by means of general entity recognition model and vertical domain Finetune.
3) Domain feature construction: Since the accuracy of entity alias obtained directly from entity recognition is not high and is not associated with the entity standard name, domain features need to be constructed to associate the mined entity alias with the entity standard name. According to the characteristics of the scene, we selected radical and partial features and pinyin features.
4) Feature clustering: use clustering to associate the entity alias mined with the entity standard name. By using this mining scheme, hundreds of thousands of entity aliases are obtained with 95% accuracy, which solves the problem of aliases with high frequency online.
3.3 Entity disambiguation
Entity disambiguation is the most critical step in entity link. The essence of disambiguation is sorting. By sorting candidate entities, the sorting score of each candidate entity is obtained and the final result is selected. Entity disambiguation has the following main problems:
1) In the voice assistant scenario, most texts are short texts, so the context features are missing, and additional features need to be constructed to help disambiguation.
2) disambiguation disambiguation can’t simply through semantic features, but also give full consideration to the global disambiguation characteristics, such as “give students a lecture Andy lau, compared the bomb to the girl’s hand” of the Andy lau, semantically, Andy lau, a professor at the university of tsinghua more accord with, is actually an actor Andy lau and post title of the hurt locker.
3) There are unaligned entities in the atlas, which leads to difficulty in model disambiguation and prone to corpus annotation errors. For example, China and the People’s Republic of China are two entities in some open source maps, which results in some training sets being correctly labeled as China and others as the People’s Republic of China.
In view of the problems mentioned above, we start from data preparation, model selection, model optimization and other aspects, and solve these problems through targeted optimization.
3.3.1 Data Preparation
When constructing disambiguation sample, we provide as much information as possible to the model. The sample consists of three parts: Query sample, entity sample and statistical feature.
Query sample construction: When the Query sample is input, the key position information of the mention is passed into the model so that the model can determine the specific position of the mention in the query. Therefore, introducing the identifier # will add the unified identifier “#” to both sides of mention, as shown in the following example:
Entity description sample construction: The entity sample needs to contain the features of the entity to be disambiguated. In the first version of the entity sample, we designed a splicing of MENTION and standard name to indicate whether the standard name and MENTION are the same. The description of “type: entity type” is constructed to provide entity type information. The triplet information of entity description and atlas is added. In the second version, we directly transform the features of whether standard name and Mention are the same or not into text description of “same/different name”, and add the feature of “non-/ yes term”. By optimizing the feature expression, the model performance is improved by about 2%.
Statistical feature sample construction: In order to avoid that the disambiguation model only focuses on semantic features, statistical features such as query and entity sample co-occurrence feature, prevalence, richness, mention co-occurrence feature are counted as input of the model to assist global disambiguation.
3.3.2 Model selection
In sorting learning, there are three common modes: Pointwise, Pairwise and Listwise. For the sorting task of entity disambiguation, which only needs TOP1, we do not need to consider the orderly relationship between candidate entities, but only global correlation. Therefore, we choose the Pointwise method.
Summarizing predecessors’ disambiguation model, Deep to be starting from ordering, Deep Type based on classification, has achieved good effect, classification and sorting task cancellation disambiguation have help, so we according to the characteristics of tasks, multitasking model framework was designed, at the same time sorting and classification, two tasks Shared model parameters, and training together, The loss function is optimized together. By sharing information about sorting tasks and classification tasks, the model can perform better. The multi-task loss function is shown below.
In order to integrate statistical features better, we add statistical features to feature vector for sorting.
Finally, the structure of our model is as follows. The query sample and entity sample are spliced and input into the pre-trained language model. The vector of CLS position and statistical feature are spliced as feature vector. The sorting task inputs feature vectors into the full connection layer, and then outputs the score of [-1,1] interval through TANH. The higher the score is, the more likely it is to be the target entity. The classification task input feature vectors into the whole link layer, and output the scores of each classification through the Softmax layer.
Through the experiment, the effect of entity link model based on multi-task is better than that of single-task model on the whole, and the specific F1 is shown in the following table.
3.3.3 Model optimization
In order to understand which input features the model focuses on, the mutual information visualization method is used to visualize the model and the importance of each word is visualized. The darker the color is, the higher the importance is. Through visualization, the model tends to focus on the types of entities and the segments used to distinguish them, such as slow eating, food, eating method, ham sausage in Example 1, and Sandy, SpongeBob squarepants, and switch power brands in Example 2. For example, there are three kinds of characters, race, dream three kingdoms, etc., it can be seen that the features that the multi-task model focuses on are all helpful for disambiguation.
Confident Learning (CL) is an algorithm framework for identifying tag errors and representing tag noises. For noting the problems according to the confidence of learning ideas, on the original data with n – trained five model flod mode, use these models to predict the original training set, and then the fusion of five model output tag as a real label, to clean up the real from the original training labels do not agree with the original label sample, we clean up the sample accounted for 3%, 80% of the label errors are +.
Adversarial training refers to the method of constructing adversarial samples and participating in model training. During normal training, if the gradient direction is steep, small perturbations can have a big effect. In order to prevent this disturbance, adversarial training attacks with disturbed adversarial samples in the process of model training so as to improve the robustness of the model. It can be seen that the effects of FGM and PGD are significantly improved in two ways of generating adversarial samples.
4. Technology application
4.1 Application of small cloth assistant
In the small assistant scenario, users have high expectations on the intelligence of the voice assistant and will ask all kinds of interesting questions, such as multi-hop questions and six-degree relational queries. With the help of entity link technology, small assistant can accurately identify the user’s point, and with subgraph matching technology can deal with entity question, structured question, multi-hop question, six-degree relation query, covering most of the structured questions.
In order to verify the entity linking ability of The assistant, we test our algorithm on both the self-built evaluation set and the thousand-word evaluation set, and it can achieve good results.
4.2 Applications in OGraph
Entity linking is not only applied to KBQA, but also a crucial part of the information extraction process. The Ograph information extraction process uses the information extraction model Casrel and the MRC model specially trained for entity class problems to obtain a large number of candidate triples.
After obtaining triples, entity link technology is used to link triples with entities in the knowledge base. In order to ensure the recall of entity links, we optimized the recall ability of candidate entities, and at the same time used mined entity noun list and ES retrieval system to recall and ensure the link effect, and produced millions of triples through the whole process.
5 concludes
Through the exploration of entity link technology, the little assistant can better analyze all kinds of entity colloquial expression, alias and ambiguity, with the help of KBQA other processes and OGraph rich knowledge reserves, has been able to answer most users’ questions, natural language understanding is slow and long, the evolution of the little assistant will never stop.
6 References
1.Deep Joint Entity Disambiguation with Local Neural Attention. Octavian-Eugen Ganea, Thomas Hofmann.
2.Improving Entity Linking by Modeling Latent Entity Type Information,Shuang Chen, Jinpeng Wang, Feng Jiang, Chin-Yew Lin.
3.End-to-End Neural Entity Linking. Nikolaos Kolitsas, Octavian-Eugen Ganea, Thomas Hofmann.
4.Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking, Samuel Broscheit.
5.Towards Deep Learning Models Resistant to Adversarial Attacks. A Madry, A Makelov, L Schmidt, D Tsipras.
6.Confident Learning: Estimating Uncertainty in Dataset Labels. Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang.
7.Towards a Deep and Unified Understanding of Deep Neural Models in NLP. Chaoyu Guan, Xiting Wang, Quanshi Zhang, Runjin Chen, Di He, Xing Xie.
Author’s brief introduction
FrankFan OPPO senior NLP algorithm engineer
He is mainly engaged in dialogue and knowledge graph related work, and has rich research and development experience in entity mining, sequence tagging, relationship extraction, entity link and other directions, and has accumulated more than ten patents.
Get more exciting content, scan code to pay attention to [OPPO number wisdom technology] public number