The public account follows “ML_NLP” \

Set as “star standard”, heavy dry goods, the first time to arrive!

From | zhihu

Address | zhuanlan.zhihu.com/p/163343976

The author | sanhe factory younger sister

Edit | public machine learning algorithms and natural language processing

This article has been authorized by the author. Reprinting without permission is prohibited

Dynamic Fusion Network for Multi-Domain End-to- End Task-oriented Dialog (ACL2020) Then, several classical papers on memory network and knowledge base task orientation are arranged in chronological order. \

【 Knowledge base-oriented 】

The task orientation of knowledge base can be integrated into many tasks. For example, ERNIE uses knowledge base based on Bert. In the task champion scheme of an information extraction competition, I saw that adding knowledge base is also an important point. When extracting characters or works, knowledge base can help you confirm what type it is. In addition, for example, knowledge base in the scene of dialogue text generation, if external knowledge base is not combined, the generated task semantics can only be derived from training data, and some common sense knowledge cannot be obtained, such as:

"I want a Starbucks." "There is no Starbucks nearby. Do you need another coffee shop?"Copy the code

If you don’t have the knowledge that starbucks is a coffee shop in your training data, you won’t be able to capture the semantics, and relying on the knowledge base is the equivalent of re-educating a 10-year-old, which is much easier than just starting to teach climbing.

【 Memory network 】

MemNN is also a branch of NLP. Its biggest characteristic is that unlike ordinary coding structures such as LSTM and CNN, which compress information into hidden State and extract features from hidden State, the memory generated by such methods is too small, and a lot of useful information is lost in the compression process. MemNN is to store all the information in an external memory and train together with inference to obtain a long-term memory module that can be stored and updated to preserve useful information to the maximum.

Through several papers to understand the specific:

1. MemNN Principle introduction

MemNN consists of two main operations:

  • The embedding process produces two matrices: input and output matrix.
  • Inference: The inference was made with the above internal vectors.
  1. The input memory representation process calculates the dot product of the problem vector Embedding B and input matrix (Embedding A) and then normalizes it to get the probability vector P which is consistent with the dimension of input matrix. That is, the degree of correlation between the question and each memory vector;
  2. Output memory representation process: The output matrix(Embedding C) is weighted and summed according to probability vector P to get output vector O, which is equivalent to selecting the combination of memory vectors with the highest correlation.
  3. Output calculation is to convert the output vector into the format of the required answer and obtain the probability of the relative answer of each word. The calculation is the matrix multiplication and accumulation of the fully connected type.
  • Inference is the inference that statements and problems are related through multi-layer neural networks (multiple superpositions on the right in the figure)

Problems with MemNN

  • MemNN saves input content without substantial compression (all parameters are embedding), with high information integrity, which has advantages over RNN and other compression models in question and answer reasoning. However, the problem is that the storage space increases linearly with the increase of content and the demand for memory bandwidth increases.
  • As the characteristic of MemNN calculation is to select the most relevant vector from the multiple vectors generated by sentence to generate the answer, the intermediate result matrix will be a very sparse matrix. Only the part with strong correlation has value, and the other irrelevant ones are almost 0. Therefore, intensive computing accelerator (such as GPU, etc.) will have poor effect. Software and hardware should focus on how to optimize sparsity.

2. Kv-memnn word typical memory network

In MemN2N, the embedding linearly transforms the context into a whole as memory, while the difference of KV-memnn is that it introduces external knowledge sources and turns memory into (key, value) key-value pairs.

It mainly includes the following three steps:

  • Key Hashing: Use the method of inverted index, select a K-V pair set with size N, select potential candidate memory from the knowledge base, ensure the word corresponding to Key appear in query without eliminating stop words;
  • Key Addressing: The phase mainly uses the results of Hashing (candidate memory) to calculate a relevance probability with the results of query after linear transformation, similar to the inner product in MemNN

 

Among themIs the query,It’s a feature filter,It’s the matrix, at the beginning

 

  • Value Reading: The key is designed to be relevant to Query (blue matrix) and the value is designed to be relevant to answer (yellow part), so the weight sum operation is performed using probability obtained from Addressing and V in KV

Representation means reading valuable memories from a knowledge source based on query biased attention,

  • Memory is updated in a multi-hop loop, which, although unsupervised, intuitively corresponds to reasoningMulti-hop update of

 

The update of

 

Finally, after dot multiplying the output of label and model, Softmax calculates loss

 

The author also tried a lot of k-V representation, interested you can read the original text;

3. Mem2seq Memory network text generation

Mem2Seq is a generation model using multi-hop attention mechanism with pointer network idea. This method effectively combines KB of information and Mem2Seq learns how to generate dynamic queries to control memory access. The difference between it and KV-MEMNN lies in the expression form of knowledge source and it is applied in the generation of SEQ. Every step of decoder is used in memory and memory update. \

Encoder: Part A in the figure above is the core of Encoder.K hop memory, query will be updated by k-hop, similar to the historical conversation to find out what the real query is, from the above several schemes can be seen, this is also a regular operation of memNet;

Decoder: Part B of the figure above describes the Decoder. At each moment, two distributions will be produced: lexical distributionAnd memory distributionMemory partial distribution refers to conversation history and KB of information.

The calculation is as follows:

The above formula shows that when the generated words are equal to the words in the memory memory, the words in the memory memory are used, that is, to complete the replication function. When the generated words are not in the memory memory,Pointing to a special character, the model will use the word list distributionTo generate the output.

4. GLMP- Global to local memory pointer network for task-based conversations

GLMP is the closest to the content of this paper we want to see, a careful introduction:

  1. The problem is how to embed knowledge base effectively in task-based dialogue system
  • Mem2Seq’s improvement lies in turning decoder into PointNetwork, combining copy and generating ideas with memory network, effectively realizing task-based dialogue knowledge base embedding;
  • The dynamic embedding of a large number of knowledge bases is no doubt equivalent to introducing a huge noise to the model and increasing the overhead of model calculation (knowledge bases are difficult to encode and decode). In order to embed knowledge bases in task-based dialogue system effectively, In this paper, we propose global-to-local MEMORY POINTER NETWORKS (GLMP).

2. GLMP structure

【Encoder 】

Global Memory Encoder Encoder dialog history, output two quantities: Global context representation and Global Memory pointer

  1. Global context representation
  • The encoding context uses a context RNN (two-way GRU) to encode the user’s words and get each time step
  • The trainable embedding matrix of each hop is similar to Mem2Seq
  • In order to overcome the drawback of MN — it is difficult to model the correlation between memories, the implicit state obtained is added to dialogue memory representation, i.e

Encoder side input:, includingRepresents the triplet mentioned above; (B is the historical dialogue information, X is the triplet, M dialogue is also converted to 3-tuple information, unified expression form, N + L the number of words combined with the external triplet and historical dialogue)

Finally, the global context is expressed as:

Global context representationis ( Last hidden state of bidirectional GRU) generated by K-Hop iteration)

2. Global Memory Pointer

The global memory pointer is used to filter noise from the rest of the knowledge base

  • The last hidden state of the encoder is used firstQuery external knowledge until the last hop (just the last calculation), do inner product similarity calculation, execute Sigmoid function (0-1 value), and finally obtain the memory distribution is the global memory pointer G, which is finally passed to the decoder.
  • Training the generation of global memory Pointers requires adding additional auxiliary tasks, using the tag forTo check theIs the Object word in the corresponding real responseIs present, 1 if present, 0 if not.

The final cross entropy loss is:

Since Sigmoid is a binary function, it is either true or false, adding additional auxiliary tasks to train the global memory pointer is to filter the knowledge base, reserving useful knowledge to pass to the decoder to instantiate the slot, so as to obtain a global memory pointer;

[Decoder part]

  • Decoder uses rough empty slot RNN(Sketch RNN), first occupies a placeholder, then uses the external knowledge base filtered by the global memory pointer to find specific information about the slot, and finally uses the local memory point to instantiate the unfilled slot value. The value of this slot may be information in the knowledge base or generated content.
    • First, a rough sketch response with unfilled slot values (but slot tags) is generated. The Sketch RNN is a single-layer GRU, but its generated word list has sketch tags. For example, “@poi is @distance Away” is produced instead of “Starbucks is 1 mile away”.
    • At each time step, the hidden state of sketch RNN has two roles:
  1. If the result is generate instead of copy, h predicts the next word, use decoder(the hidden state of d dimension of t time step) is calculated by generating the next word which can be expressed as:

 

Loss is expressed as\

2. As the query vector of the external knowledge base, when the generated result is a Tag, the global memory pointer encoded previously will be handed to the external knowledge base to determine what to fill in the Tag and play the role of filtering the external knowledge base;

    • As a query vector to make a PointNetwork with the filtered external knowledge base, the distribution generated is Local memory pointer (L)
    • The calculation is as follows (copy point principle part, there is a generated target generated, no copy) :

Describe GLMP in a few sentences

  1. The combination of knowledge base and memory network is introduced
  2. Decoder introduces Pointer Net
  3. Encoder encoder memory network and context (including knowledge base entities) and generate global concern Pointers (for filtering noise), decoder partly borrows FROM pointerNet, using hidden state to generate local Pointers to determine which entity to point to when copy

DFU: multi-domain end-to-end task-oriented dynamic converged conversation network

Dynamic Fusion Network for Multi-domain End-to- End Task-oriented Dialog

Compared with GLMP, the main problem to be solved in this article is how to quickly transfer learning in different areas in task-based dialogue. Task-based dialogue is domain-dependent, and data and models in different domains differ greatly. The author designed a GLMP-based architecture, DF-NET, which can automatically learn the relevance of different domains and the knowledge specific to each domain.

  • Enhanced version of encoder and decoder modules
  • On the basis of GLMP, Encoder and decoder fuse the hidden nodes of mixed domain and unique domain, includingRefers to the Shared,Refers to the domain – specific

 

Specific strengthening method

  • Dynamic fusion
    • The author thinks that the enhanced version of Encoder and decoder ignore the fine-grained correlation of different domains even though they integrate the information of different domains. Therefore, the architecture of dynamic fusion is:
    • The data of each domain is first GLMP to obtain the specific features of each domain
    • All the private features are fused by the dynamic domain-specific feature fusion module. The author uses the Mixture of Experts mechanism (MoE) for reference, and it can be seen that the probability distribution of prediction belongs to a private domain. Essentially, it looks like a little attention with an auxiliary task to compute each field

   

  • The fusion of shared features refers to replacing the original encoder and decoder with the enhanced version
  • Against learning
    • Finally, the author makes some adjustments to the model for better training, and introduces adversarial learning to better characterize common learning domains
    • A gradient inversion layer is introduced
    • The final loss isWe know that from GLMPAlso composed of multiple mixed Loss,It is a mixed task on the basis of mixed loss. However, the experimental results are very strong. With a small amount of data, the trans ability of the model is more than ten points higher than the best one before

Highlights of past For beginners entry route of artificial intelligence and data download machine learning and deep learning notes such as printing machine learning online manual deep learning notes album "statistical learning method" code retrieval based album album download AI based machine learning mathematics Get a sale standing knowledge planet coupons, copy the link directly to open:  https://t.zsxq.com/y7uvZF6This qq group704220115. To join the wechat group, scan the code:Copy the code