【ICML 2020】REALM: Retrieval-Augmented Language Model PreTraining

Knowledge is power

bacon

The background,

Last year was a year of rapid development of language models. BERT, XLNET, Albert and other models continued to refresh the NLP lists. The ones that stand out on the NLP list are reading comprehension tasks, such as SQuAD, etc. In the case of SQuAD, the model reads a given piece of text and then answers a few questions to which, if there are answers, the answers must be found in the text. So even though it’s called reading comprehension, it’s actually a little bit like sequence tagging, where you mark the answer segments in a given sequence. The question this paper addresses is called open-domain QA, and for a question Q, the model needs to find the answer from a knowledge base containing a large number of documents, rather than from a single article like the SQuAD dataset.

Most language models are trained with a task called Masked Language Model (MLM for short), which enables the model to learn the ability similar to clotting. By training on large corpora, pre-trained language models such as BERT actually have some implicit knowledge. For example, “The is The currency of The United Kingdom,” BERT would probably fill in The word “pound.” Although he still learns and reasons from word co-occurrence information, he seems to have so-called knowledge. Since last year, there has been an increasing number of research from pure language models to language models with knowledge embedding, such as ERNIE, two namesake models proposed by Tsinghua university and Baidu.

But the implied knowledge mentioned above is difficult to grasp and expand. This paper proposes a more modular and interpretable knowledge embedding method. In general, his approach is to train an independent “contextual knowledge retriever,” which determines which knowledge should be used in reasoning. Moreover, the extractor performs unsupervised pre-training with the language model to greatly improve model performance.

methods

As shown in the figure above, the whole paper involves two tasks: the language model pre-training task MLM on the left and QA task on the right. Below is a more complete flow chart of the pre-training task, which is where we start. \

The whole process is divided into two key steps. Look at the first step, namely neural knowledge retriever, it is responsible for the calculation of p (z | x). To do this, you first need to code z and x. For the problem X, input BERT directly and take the output of [CLS] token as the coding vector, while for the document Z, input BERT after connecting the title and text with [SEP], also remove the output of [CLS] token. BERT’s output vector is also reduced in dimension. The \

For a z in the document library, then

Where F is the correlation between the problem and the document,

This part is called a neural knowledge retriever, and you’ll get a P for each Z. Now we can do the second step, which is to integrate x and z and find y. The picture above is an example of pre-training, with y being the cut out word. The way to use Z is to piece together the body of Z and x to provide context information, and then optimize the goals below

Where J refers to the JTH masked token.

It’s a little different when you’re doing QA. Since this is for a specific Z, the author reduces the open-domain Q&A task to a reading comprehension task like SQuAD looking for answers in a document.

This part is known as knowledge-augmented encoder. \

training

Tasks for the pre-training phase and QA Finetune phase have been described above. The process of training is to maximize right y corresponding logp (y | z, x), and the two tasks can be described above the end-to-end optimization.

But there’s a problem here, there’s a formula that sums up the relative probabilities of all documents Z in the entire knowledge base, and it’s very difficult. The authors use this step only on the highest probability k document to approximate calculation, because most of these documents due to the problem is not related, p (z | x) are very small. But we haven’t solved the problem of how to find the k documents with the highest probability.

| x formula can be found p (z) is proportional to the inner product of two encoded, due to everyone, the denominator of molecular order is the fraction of the order. So you can use the Maximum Inner Product Search algorithm to solve this problem. However, to build a fast index, the two encoded vectors are required to be determined, and this condition cannot be satisfied because the encoder is constantly trained. To strike a balance, the authors decided to update the encoder every few hundred steps and rebuild the index. And this only happens when pre-training the language model, using the encoder generated by the language model only once during the Finetune QA task **** all z and x’s are encoded and indexed.

Additional strategies

During the study, the author found some strategies to make the model more trainable.

Train MLM only by training words that really require knowledge (usually entities and dates)
Add a dummy NULL document outside the topk document
Avoid x in Z (because x is masked, if it comes from Z, the answer is exposed!)
To avoid the vicious cycle of a cold-launched Retriever being too lame, they used a model to initialize the Retriever with ICT as the task

Results contrast

This paper’s main competitors are combinations of original sparse retriever+ neural reading comprehension models, such as the well-known DrQA. Sparse Retrievers are models that use features such as TFIDF for retrieval. There are also some combinations of neural retriever+ Neural Reader that are similar to this article. There is an ORQA mentioned here, very similar to this one, but with the added step of pre-training. Finally, there are some generative models, such as T5 after Finetune (Terrible! \

The results of Natural questions-open (NQ), Web Questions (WQ) and Curated Trec (CT) are as follows

In a word, it’s awesome! And this model only takes the top 5 documents, other models may take 20-80 documents and still can’t beat him. Note that there are two data inside the parentheses of ours, Z is knowledge base, it is easy to understand, X refers to the corpus used for pre-training. In addition, pretraining is a very critical step that greatly contributes to performance. \

Afterword.

I feel that this paper and ORQA mentioned by him are very impressive, and knowledge embedding has evolved from last year’s solid granularity to the present sentence-level and discourse level. Imagine if today’s word-based Sparse search engines evolved into Dense, the NN-friendly Dense search engine, where everything might be mapped into a vector space. All kinds of neural networks will gallop freely in this vector space. ! * * * *

Links to papers: kentonl.com/pub/gltpc.2…

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching

Copy the code

【ICML 2020】REALM: Retrieval-Augmented Language Model PreTraining

“`php

Related Posts

MySQL Learning Notes (1) – Overview

How do I access the service from outside

Rust principle and Engineering practice | Tokio Defects of asynchronous propagation