The article source | turbine cloud community
The original address | improve unsupervised through the problem of information generated questions and answers
The original author | Mathor
Shanghai announced yesterday that it will start a new round of segmented, gridded nucleic acid screening across the city. ** [Hengyuan cloud] ** Yundundun ☁️ to remind friends, whether at home or out, should pay attention to the epidemic oh ~
Maybe there are a lot of friends at home, maybe the enthusiasm of everyone is enhanced in the spring, and the community technology is posted one after another!
And today, I bring you an old friendMathorIn the article.
The text start
1 Abstract
Question generation (QG) is the task of generating plausible questions for a given
- Heuristic disadvantages: The generated problems are closely related to their declarative counterparts.
- Monitoring methods: They are closely related to the domain/language of the QA data set used as training data.
This paper proposes an unsupervised QG approach that uses questions generated heuristically from abstracts as a source of training data for QG systems. (Use heuristic methods to turn declarative summary sentences into appropriate questions)
- The heuristic methods used in this paper include dependency parsing, named entity recognition, semantic role labeling, etc.
The neural QG model was trained end-to-end by generating questions through unsupervised QG and then combining the generated questions with the original article.
1 Introduction
The purpose of question generation is to generate meaningful questions given a set of input paragraphs and corresponding answers.
Early studies on QG are based on template generation, but such questions lack diversity and have a high degree of lexical overlap with the corresponding statement sentences, for example: Stephen Hawking announced the party in the morningStephen\ Hawking\ announced\ the\ party\ in\ the\ morningStephen Hawking Mission the party in the morning. Hawking mission the party in the morning. Who announced the party in the morning? (Who announced the party in the morning?) , you can see that there is a high lexical overlap between the generated questions and the statements. This is undesirable in problem systems, because the strong lexical cues in the question make it a poor understanding of the true meaning.
Later, the neural SEq2SEQ model became dominant in QG, and the triad training of
,answer,query> ,answer,query> was usually obtained from human-created QA datasets. This approach limits the domain and language of the dataset and requires a lot of time and money. ,answer,query>
This article presents a new unsupervised approach, describing QG as a summarization-questioning process. By using the freely available abstract data, dependency analysis, named entity recognition and semantic role annotation are performed on the abstracts, and then heuristic methods are applied to generate problems based on the parsed abstracts.
Figure 1 shows an example (a sample question generated by a semantic role annotation heuristic using summary sentences of different candidate answer spans) :
The problem is generated from the abstract rather than the original paragraph, so the abstract serves as a bridge between the question and the paragraph, and there is less lexical overlap between the question and the paragraph. This method is feasible, because the abstract contains the most important information in the paragraph and is semantically similar to the paragraph. In addition, summary data is much easier to obtain than QA data sets, since many QA data sets are created specifically to train QA systems.
2 Realated Work
In unsupervised QA, QA models are trained using synthetic data based on QG models rather than existing QA datasets. Instead of resorting to existing QA datasets, using an unsupervised QG approach, For example, Unsupervised Question Answering by Cloze Translation and template-based Question Generation from spectrum Sentences for Improved Unsupervised Question Answering. In this paper, template/rule-based problem generation method is proposed, and retrieved paragraphs and referenced paragraphs are regarded as source paragraphs. To alleviate lexical similarities between paragraphs and questions.
3 Methodology
The proposed approach uses synthetic QG data and then uses some heuristics to create QG data from summary data to train the QG model.
Figure 2 shows the model of this paper (the answer and question are generated based on the abstract of the question generation heuristic, the answer is combined with the article to form the input of the encoder, and the question is used as the ground-truth output of the decoder) :
3.1 QUESTION GENERATION
In order to avoid generating trivial questions that are highly similar to the corresponding declarative statements, summary data is used as a bridge between the generated questions and the original article.
- Dependency analysis (DP) is performed for summary sentences, followed by named Entity recognition and Semantic role analysis (SRL).
- DP is used as a means of identifying the main verb (verb root) and other components (auxiliary verb).
- The NER is responsible for summarising all the entities in the sentence in order to discover the most appropriate question to generate.
- The key to statement analysis is SRL, which is used to capture all the semantic frames of a summary sentence. Each frame consists of a verb and a set of arguments corresponding to the phrase in the sentence.
- For example, parameters can include AgentAgentAgent (which initiates the action described by the verb), PatientPatientPatient (which performs the action), and a set of modifier parameters, such as arg-tmp or arg-loc
- Questions are generated from arguments based on argument type and NER tag, which means that WH-words can be jointly determined
Example in Figure 1: shows the SRL analysis [
]has [
] [
] [
]. From these three arguments, you can generate the three questions shown in Figure 1.
3.2 TRAINING A QUESTION GENERATION MODEL
The summary data used in this paper consists of
−question−answer>
−question−answer> −sumary> −sumary> −summary> −summary>
Instead of deploying a pipeline, this article trains an end-to-end SEQ2SEQ model that first generates a summary and then regenerates problems to eliminate the risk of error accumulation during generation. By using these QG data to train the neurogenerative model, the model is expected to learn a combination of summary and problem generation. In other words, such knowledge can be implicitly injected into a neurogenerative model through QG data.
To train the problem generation model, this paper connects each paragraph and answer to form a sequence: Passage
answer
answer
passage
answer
where
is a special symbol used to separate paragraphs and answers. This sequence is the input, and the target output (target) is question. In this paper, BART is used to generate, and optimization is carried out by the following negative log likelihood loss function:
Qiq_iqi is the third token of question, and C, AC, AC, and A represent context and answer.
4 Experiments
4.1 EXPERIMENT SETUP
4.1.1 Question Generation
Datasets tested the proposed method using XSUM news summary data collected by BBC News website. The XSUM consists of 226,711
QG Details uses AllenNLP to get the dependency tree, named entities, and semantic role tags for summary sentences.
Delete triples that meet any of the following three conditions:
- Articles with more than 480 tokens (exceeding the maximum BART input length);
- Articles in which no more than 55% of the token’s answers span (to ensure adequate lexical overlap between the answers and the essay)
- Questions with less than 5 marks (very short questions may remove too much information);
A total of 14,830
4.1.2 Unsupervised QA
Datasets ran experiments on six extracted question-and-answer Datasets: SQuAD1.1, NewsQA, Natural Questions, TriviaQA, BioASQ and DuoRC.
Official SQuAD1.1, NewsQA and TriviaQA data were used in this article, and for Natural Questions, BioASQ and DuoRC, pre-processed data published by MRQA was used.
Wikidumps is used to create a composite QA Training data. Wikidumps first remove all HTML tags and reference links, then extract paragraphs of more than 500 characters. Extract 60K paragraphs from all the paragraphs in the Wiki dump. Use Spacy and AllenNLP’s NER toolkit to extract entity references in paragraphs.
Then, delete the paragraph that meets one or more of the following three criteria, i.e. the answer is correct:
- Paragraphs of less than 20 words and more than 480 words;
- A paragraph in which the answer is not extracted, or an answer extracted due to text tokenization is not in the paragraph;
- Answers made up of single pronouns.
Passage
answer
answer
passage
answer
passage
answer
answer
and then input into the trained BARt-QG model to obtain the corresponding question. This resulted in 20K synthetic QA pairs, which were then used to train unsupervised QA models.
4.2 the RESULTS
The BERT QA model was trained with 20,000 generated composite question pairs, and its performance was first validated on the three Wikipea-based validation sets SQuAD1.1, Natural Questions, and TriviaQA. The results of this method are shown in Table 1 and Table 2.
Unsupervised baseline:
- The QG model was trained by Unsupervised neural machine Translation, and 4M synthetic QA instances were generated to train the QA model
- Use a dependency tree to generate a problem and use the referenced document as a paragraph
4.3 EFFECT OF DIFFERENT HEURISTICS
- Naive−QG\mathrm{Naive-QG}Naive−QG only uses summary sentences as context (not the original paragraph) and only uses appropriate questions to replace the span of the answer. For example, Stephen Hawking announced the party in the morningStephen\ Hawking\ Announced \ the\ party\ in\ the\ morningStephen 1. Hawking announced the party in the morning. “Naive−QG\ Mathrm {Naive-QG}Naive−QG” Stephen\ Hawking\ announced\ what\ in\ the\ morning? Stephen Hawking announced what in the morning? . Abstract sentences were used as input and questions as target output to form QG training data.
Use the original text of the summary as paragraphs, not summary sentences, to avoid high lexical overlap between paragraphs and questions.- Main Verb\mathrm{Main\ Verb}Main Verb predicate: generate questions only from the SRL framework of the predicate in the dependency tree of a summary sentence, and use verbs in clauses;
- Wh-movement \mathrm{wh-movement} wh-movement moves the question word to the beginning of the sentence;
- Decomp−Verb\mathrm{decomp-verb}Decomp−Verb decomposes into basic forms and auxiliary words;
- NER – Wh \ mathrm {NER – Wh} NER – Wh: Use the NER tag to get more accurate questions to answer, for example, for NBA player Michael JordanNBA\ Player \ Michael\ JordanNBA player Michael Jordan, The question word would be which NBA playerwhich NBA playerwhich NBA player rather than who or whatwho or whatwho or what.
Five enlightenment
- Can you generate problems from some description?
- Can a heuristic algorithm be some attribute knowledge?