background
Entity chain refers to associating entities in a given text (such as search Query, microblog, conversation content, article, video, picture title, etc.) with corresponding entities in a given knowledge base. There are two types of task design: Pipeline and end-to-end.
Task abstraction
The problem that
The thousand-word data set held by Baidu Feiyu: The entity chain reference task oriented to Chinese short texts gives Chinese short texts, mention in short texts and their corresponding positions. It is necessary to predict the ID of the corresponding entity in the text of Mention in the given knowledge base. If there is no corresponding entity in the knowledge base, that is, NIL, entity category needs to be given.
There are 7W training set data, the average length of query is 22, and 26W mentions are included. Each mention has 6.3 candidate entities, and 3W NIL entities are linked to, among which 1.6W have the same name entities in the knowledge base. Three characteristics can be found:
- Text length is short and context information is limited
- There are many candidate entities
- There’s a lot of NIL, more than 10%
Model solution
With the mention information given in this contest, we only need to consider two tasks: entity disambiguation and NIL classification. The key tasks are: how to design input samples, how to design model structures, how to sort NIL entities with other entities, how to mine richer and multidimensional features, etc.
Sample structure
We select pre-trained language models such as ERNIE and RoBERTa for semantic feature extraction, and splice the text and entity description information requiring chain finger with [SEP] symbol as the input of the model.
Query sample construction: The query sample input needs to pass the location information of mention into the model so that the model can determine the specific location of mention in the Query, for example: “SpongeBob: SpongeBob and Patrick are working hard and they are on the highway!” Spongebob appears twice in “SpongeBob squarepants”, which is linked to the cartoon spongeBob Squarepants and the cartoon character SpongeBob Squarepants respectively, which needs to be distinguished. To solve this problem, we pass in location information by introducing identifiers, and add the unified identifier “#” to both sides of mention, as shown in the following sample:
Entity description sample construction: The entity in the database contains the standard entity name Subject, the entity type type, and some SPO information about the entity. When constructing the sample, the standard name of the mention field and entity is concatenated with “-” as input to reinforce whether the standard name and Mention are the same. Entity type is an important information in disambiguation. We construct the description “Type: entity type” to provide the entity type information, which is placed after the entity standard name in order to prevent truncation. The SPO information uses only attribute values, which reduces the number of samples exceeding the maximum input length by 35%.
Statistical feature sample construction: data and features determine the upper bound of the model. In order to enrich the model input, features such as entity type, entity length, Mention length, Jaccard similarity of entity and Mention are added to the feature vector output of the model.
Model structure
Entity disambiguation is essentially a process of sorting candidate entities, which uses Query and entity information as input to sort candidate entities, give candidate entity scores, and select TOP1 entities. In sorting learning, there are three common modes: Pointwise, Pairwise and Listwise. For the sorting task of entity disambiguation, which only needs TOP1, we do not need to consider the orderly relationship between candidate entities, but only global correlation. Therefore, we choose the Pointwise method.
Schematic of sorting learning model
The entity classification task and entity chain reference task do not appear to be directly related, but Shuang Chen [2] proposed that disambiguation is relatively easy when the type of Mention can be predicted. Therefore, we designed a multi-task model framework to conduct sorting and classification at the same time. The two tasks shared model parameters, trained together, and the loss function was optimized together. By sharing the information of sorting and classification tasks, the model could have better performance.
The final structure of our model is as follows: Query and entity description are spliced into the pre-training language model, and the vector splicing of CLS and MENTION starting and ending positions is taken as feature vectors. The sorting task inputs feature vectors into the full connection layer, and then outputs the score of [-1,1] interval through TANH. The higher the score is, the more likely it is to be the target entity. The classification task input feature vectors into the whole link layer, and output the scores of each classification through the Softmax layer.
Entity chain refers to and classification model structure
Model optimization
Data cleaning
Data cleaning based on confidence learning: By analyzing the data set, we found that there were some labeling errors in the data. According to the idea of Northcutt [6] ‘s confidence learning, we trained 5 models in the way of N-FLOD on the original data, used these models to predict the labels of the original training set, and then fused the labels output by the 5 models as real labels. Then, the samples with inconsistent real labels and original labels should be cleared from the original training set, and the number of samples cleared based on experience should not be more than 10%.
NIL entity sorting experiment
In the process of entity disambiguation, we designed three schemes to solve the problem of how to sort NIL entities together with other entities, whether to use it as a separate classification task, or to convert NIL into a specific type of entity to participate in sorting.
- Scheme 1: Only the entities existing in the knowledge base are sorted. When the score of Top1 is less than the threshold value, it is considered as NIL entity.
- Scheme 2: Construct NIL entity sample “mention-mention, type: unknown type”, for example: “Three Heroes – Three Heroes, type: unknown type”, indicating that the entity is an unknown entity. During prediction and training, an unknown entity is added to all mention candidate entities to participate in sorting.
- Scheme 3: Splicing all candidate entities, input them together with query samples into the model for classification, and judge whether they are NIL entities. In theory, this can bring more global information. Considering the training speed, we first sorted the schemes in 1), and then spliced the entity descriptions of top3 to train a classification model.
Combat training
Schematic diagram of confrontation training process
Adversarial training refers to the method of constructing adversarial samples and participating in model training. During normal training, if the gradient direction is steep, small perturbations can have a big effect. In order to prevent this disturbance, adversarial training attacks with disturbed adversarial samples in the process of model training so as to improve the robustness of the model. We experiment FGM and PGD to generate adversarial samples.
Generate code against samples
Analysis of experimental results
Model interpretability
Interpretable modeling schematics
After training the model, we first want to know what characteristics the model has learned. C Guan [7] proposed a visualization method based on mutual information. Compared with other visualization methods, this method is universal and consistent, that is, the interpretation of indicators has a clear meaning, and the differences between neurons, layers and models can be compared at the same time.
Interpretable modeling code
To understand which input features the model focuses on, we replicated the algorithm based on Paddle2.0 and visualized the importance of each word. The darker the color, the higher the importance. Through visualization, the model tends to focus on the types of entities and the segments used to distinguish them, such as slow eating, food, eating method, ham sausage in Example 1, and Sandy, SpongeBob squarepants, and switch power brands in Example 2. For example, there are three kinds of characters, race, dream three kingdoms, etc., it can be seen that the features that the multi-task model focuses on are all helpful for disambiguation.
Analysis of experimental results
Parameter Settings in related experiments are as follows: The batch size of ERNIE and BERT’s model is 64, with an initial learning rate of 5E-5 and max_SEq_length of 256. The Batch size of Roberta-Large’s model is 32, with an initial learning rate of 1E-5 and max_SEq_length of 256. The learning rate attenuation strategy based on exponential attenuation is adopted.
Comparing the results of different pre-training models and confidence learning, it is found that the model effect roberta-large >ERNIE + confidence learning >ERNIE>BERT. It can be seen that ERNIE specially optimized the task for Chinese data, which did have a better effect than BERT. However, the comparison between ERNIE (12 layers) and Roberta-Large (24 layers) shows that one inch is longer and one inch is stronger, and more parameters can have better performance.
Table 1 Comparison of pre-training models
We used an ERNIE based model to compare single-tasking to multi-tasking. By comparing the model effect of multi-task and single-task, we find that multi-task is not only simpler, but also better than single-task combination. When sorting, the model needs to judge whether mention is consistent with candidate entities with the help of type information. NIL classification can learn the information of other candidate entities in the knowledge base, so the shared parameters of two tasks can make the model extract the commonness of two tasks and improve the model effect.
Table 2 Single-task vs. multi-task comparison
It can be seen from models 1, 2, and 3 that adversarial learning is a general model optimization method, with significant improvement in all models. However, there is no obvious difference between FGM and PGD, the strongest first-order adversarial method, in entity chain finger-pointing task.
Table 3 Effect of antagonistic learning
In the experiment of NIL participating in sorting in different ways, we found that there was little difference in the results of constructing NIL sample participating in matching and sorting TOP1 score card threshold, with AUC of 0.97 and 0.96 respectively. The AUC of training NIL classifier was only 0.94, which was speculated because when sampling candidate entities in top3, You have error accumulation.
NIL different ways to participate in the ROC curve of sorting
By fusing the models with good performance, we achieved 88.7 in THE DEV test set, 88.63 in the A-list data set and 91.20 in the B-list data set, finally ranking second.
The exploration of The assistant of The little Boo
The assistant handles tens of millions of user queries per day, with up to 30% of queries involving entity words. These real dialogues sent by different people not only contain thousands of subjective expressions, but also a large number of innovative words, polysemous words and synonyms. At the same time, they often face ambiguous questions such as “Who is Li Bai” and “I want to listen to Li Bai”.
Small assistant entity chain refers to the process
The technical accumulation of The assistant of Xiao Bu not only helps us to be among the best in the competition, but also helps users to solve the common user problems such as “brother’s representative works”, “Who is Li Bai”, “I want to listen to” Li Bai “and so on, which are easily misunderstood by voice assistants. Hegel said: people stand up by thought. Thinking gives human dignity and drives civilization forward. The assistant brings together the thoughts of countless heroes behind him, and also works hard silently, and then astonishes everyone with interesting, intimate and soul.
thoughts
If a worker wants to do a good job, he must first sharpen his tools. This competition uses the flying oar 2.0 framework for training, and the program can be executed and output results immediately in dynamic graph mode, which improves the coding efficiency. With the help of Baidu’s PaddleNLP toolkit, you can seamlessly switch between pre-training models like ERNIE, BRART and RoBERTa, perfect for quick experimentation during competitions.
PaddleNLP toolkit link: github.com/PaddlePaddl…
The topic of this competition is also worth exploring, how to combine the two tasks of entity disambiguation and classification organically, we can make many attempts. An entity chain refers to the task can be abstracted into a variety of ways, enough to see the impermanence of soldiers, water and shape, when we solve the algorithm problem, we should think out of the box, try to abstract the problem from different angles, find the best solution.
Links to this project:
In addition to the entity chain task, the Thousand Words project also has sentiment analysis, reading comprehension, open domain dialogue, text similarity, semantic analysis, machine co-transmission, information extraction and other areas continue to top the list.
Portal: www.luge.ai/
reference
[1]Deep Joint Entity Disambiguation with Local Neural Attention. Octavian-Eugen Ganea, Thomas Hofmann.
[2]Improving Entity Linking by Modeling Latent Entity Type Information,Shuang Chen, Jinpeng Wang, Feng Jiang, Chin-Yew Lin.
[3]End-to-End Neural Entity Linking. Nikolaos Kolitsas, Octavian-Eugen Ganea, Thomas Hofmann.
[4] Improving Entity Linking by Modeling Latent Entity Type Information. Shuang Chen, Jinpeng Wang, Feng Jiang, Chin-Yew Lin.
[5]Towards Deep Learning Models Resistant to Adversarial Attacks. A Madry, A Makelov, L Schmidt, D Tsipras.
[6]Confident Learning: Estimating Uncertainty in Dataset Labels. Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang.
[7]Towards a Deep and Unified Understanding of Deep Neural Models in NLP. Chaoyu Guan, Xiting Wang, Quanshi Zhang, Runjin Chen, Di He, Xing Xie.
The authors introduce
Fan by the source
OPPO Xiao Bu assistant NLP engineer
Responsible for dialogue and knowledge mapping, including intention classification, sequence tagging, relationship extraction, entity chain pointing, etc.