| guide language In recent years, the task-oriented Dialogue of popular research mainly focuses on the study of end-to-end framework, basic and traditional – Language comprehension algorithm framework module (Spoken Language Understanding, Dialogue Management module (Dialogue Management), There is a big difference between Natural Language Generation modules. Some of these models are actually the architecture of sequence-to-sequence combined with knowledge base in essence. For example, Manning’s two articles in 2017 are also encoder-decoder models. However, such models have high requirements for data annotation (see the annotation of Stanford Dialog Dataset) and are still in the exploration stage. Traditional algorithm frameworks are still the main ones with strong practicability in the industry. This paper also focuses on the language understanding module in the traditional algorithm framework, focusing on the joint model of intention and slot in the language understanding module.

The main directory structure for this article

First, we will review the key points of task-oriented dialogue, including concepts and examples (Ideepwise and Ali Xiaomi, etc.). Secondly, from the task-based semantic representation to the overall dialogue framework also includes some examples; Finally, as the focus of this paper, we will introduce the joint model of the intention and slot of the language understanding module in the traditional algorithm framework.

1. What is task-based?

Task-based concepts

object

Task-oriented conversations are those that provide information or services under specific conditions. Usually it’s to satisfy a user with a specific purpose.

Specific scenarios and functions

For example, check traffic, check phone charges, ordering meals, booking tickets, consulting and other task-based scenarios. As the needs of users are complex, they usually need to be stated in several rounds, and users may modify and improve their needs constantly during the dialogue. Task-based robots need to help users clarify their goals by asking, clarifying and confirming.

Task evaluation

First of all, there should be a clear goal to solve the problem. The more important point of the evaluation index is that the rounds should be as few as possible, and the answer should be straight as far as possible. If the answer is not the question, the use of users will be seriously affected.

Taskiness versus small talk

Ideepwise products:


Ali Honey Products:


It can be seen from the above that the purpose of task-oriented dialogue is very clear. The key is to obtain the intention and constraint conditions (slots) and track the state of the dialogue.

The place of task-oriented dialogue in the family


The classification here is as follows: First, the dialogue is divided into question answering and conversation, and the question answering is divided into unstructured documents and structured documents according to whether the document is structured or not. Unstructured documents include IR information retrieval (such as QA pairs, problems in searching documents) and IE information extraction (such as reading comprehension, searching for precise fragments in documents). The difficulty of this part lies in the calculation of similarity. Structured documents include database and knowledge graph, whose input is structured fragment. Database has the function of query, and knowledge graph has the ability of query and reasoning. In fact, the difficulty of this part is how to obtain constraints (slots) in natural language. Next, we will focus on conversation, which can be divided into chat type, task type, etc. Traditional task type is divided into language comprehension module (SLU), conversation management module (DM) and natural language generation module (NLG). The rest of the introduction focuses on an introduction to the federation model in the SLU module.

The representation of semantics is a difficulty in the natural language domain, and this is also true for task-oriented dialogue……

2. Semantic representation in task type

How to parse natural language into appropriate semantic representation is always a difficult problem. Here are three related semantic representations.

1. Distributed semantic representation

(Distributional semantics)

It mainly includes word level and sentence level. Word2Vector, GloVe, ELMo, FastText… Skip-thoughts, quick-thoughts, InferSent… Equally distributed representation

2. Semantic representation of the framework

(Frame semantics)

Action(slot,value), such as query currency, Inform(currency = RMB,…)

3. Meaning representation of model Analects

(Model-theoretic semantics)

This is an interesting expression, and Paul Gochet’s book Philosophical Perspectives for Traditionally atics shows this.

For example, some words represent two sets of operations


The traditional task-based algorithm framework was introduced in the previous article, and here is a review of the diagram above:


Here we begin our language understanding module extension with an example

For a dialogue, we need to analyze it through the language understanding module, including the identification of the domain, such as airline or hotel, the intention of each segment, such as ticket purchase or refund, and the constraint information (slot) under each specific intention.


3. Language comprehension module

The language comprehension module mainly includes the recognition of intention and slot. Intention recognition is actually a classification problem, such as rule-based, traditional machine learning algorithms (SVM), deep learning algorithms (CNN, LSTM, RCNN, C-LSTM, FastText), etc. Intents, which also involve transformations in conversation, are not explained here. Slot recognition is actually a sequence marking task, such as rules-based (Phoenix Parser), based on traditional machine learning algorithms (DBN; SVM), based on deep learning algorithms (LSTM, BI-RNN, BI-LSTM-CRF). Some people may not make much of a distinction between slot and entity. Here’s an example using BIO:

For example, in “Show flights from Boston to New York Today”, both Boston and New York are marked as city for the entity, while for slot, they are divided into departure city and destination city. It can be said that slot types are more diverse than entities.


Joint Model (Intent+Slot)

1. The first paper mainly uses two-way GRU+CRF as the joint model of intention and slot.

Zhang X, Wang H. A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding[C] IJCAI. 2016

The model is as follows:


1. Input as windowed word vector:


2. Learn high-dimensional features using bidirectional GRU model.

3. Intention and slot

For intention classification, the Max pooling slot is used to obtain the expression of the whole sentence by using the characteristics of each learned hidden layer, and then softmax is used to classify intentions:



For slots, the probability of the forward network to each label is used for the input of each hidden layer, and CRF is used to score the global to get the optimal sequence.


The combined loss function is the maximum likelihood of slot and intention


The model in this paper is simple, and the accuracy of slot identification and intention identification reaches a high level.

The results of this paper are based on the ATIS dataset:

Intent: 98.32 Slot (F1) : 96.89

2. The second part mainly uses semantic analysis tree to construct a joint model of path feature for slot and intention recognition. (RecNN + Viterbi)

Guo D, Tur G, Yih W, et al. Joint semantic utterance classification and slot filling with recursive neural networks[C] 2014 IEEE. IEEE, 2014

First, the basic Recursive NN model of this paper is introduced


The input is a single word vector (the optimized input is the word vector of the window), and each part of speech is regarded as a weight vector. In this way, the operation of each word in its path is a simple dot product operation between the word vector and the part of speech weight vector. For example, the square in the figure above is the result of dot product operation of weight vector of part of speech and input vector. When a parent node has more than one child branch, it can be regarded as the sum of the product of each branch and the weight points.

Intent recognition module

In this paper, the output vector of the root node is directly used to make a classification.

Slot identification

This module introduces the feature of path vectors


For example, the path of the word “in” is “in-pp-NP” in the semantic analysis tree. Each output vector of the path is weighted to obtain the feature of PATH. In this paper, concat of path features of three words is used as tri-Path feature to carry out slot classification, so as to predict “in”.


To optimize the

The article also makes some optimizations based on the baseline:

Optimize the input for the word vector of the window


Different from the previous simple weighted network, the node adopts a nonlinear activation function


CRF global optimization based on Viterbi and maximization of annotation sequence based on TRI-Gram language model are adopted


The results of this paper are based on the ATIS dataset:

Intent: 95.40 Slot (F1) : 93.96

3. The third chapter is mainly based on the MODEL of CNN+ TRI-CRF

Xu P, Sarikaya R. Convolutional neural network based triangular crf for joint intent detection and slot filling 2013 IEEE Workshop on. IEEE, 2013                         

Take a look at the CNN+TriCRF model as follows:


Model for slot identification

Input is the word vector of each, and the high-dimensional feature H is obtained through a convolution layer, and then scored by TRI-CRF as a whole. Tri-crf differs from linear CRF in that a forward network is used to get the classification of each label before input. Let’s analyze the scoring formula:


The above t(YI-1,Yi) is the score of the transfer, hij is the high-dimensional feature obtained by CNN, and the high-dimensional feature at each moment obtains the probability of each tag through a forward network, so the combination of the first two is the overall score.

For intent recognition

CNN adopts the same set of parameters to obtain the high-dimensional feature H of each hidden layer. Max pooling is directly used to obtain the expression of whole sentences, and Softmax is used to obtain the intention classification.


Combining the above is in effect a joint model.

The results of this paper are based on the ATIS dataset:

Intent: 94.09 Slot (F1) : 95.42

4. The fourth chapter is mainly based on attention-based RNN

Liu B, Lane I. Attention-based recurrent neural network models for joint intent detection and slot filling[J]. 2016.

Let’s start with the concept of context vector, See Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Remote Learning to align and translate[J]. 2014.


G in the above formula is actually a forward network to get the correlation between each decoder hidden layer and each encoder hidden layer in the input sequence, that is, the attention component. Weight the output of the encoder hidden layer at each moment with the current moment’s attention to get the context vector.

Enter the body, the encoder-decoder model used in this paper is as follows:


slot

Figure A hides the unaligned attention model. Decoder hidden layer is not aligned. The input of each cell on the decoder side is the input of the last hidden layer S, the probability s of the last label and the text vector C.

Figure B shows the model of hidden layer alignment without attention. The input of each cell at the decoder terminal is the hidden layer S at the last moment, and the output of the hidden layer at each moment of the probability S of the tag at the last moment and the corresponding encoder.

The hidden layer in Figure C aligns the attention model. The input of each cell in the decoder terminal is the hidden layer S at the last moment, the probability s of the label at the last moment, the input of the probability S of the label at the last moment and the input of the text vector C and the output of the corresponding encoder at the hidden layer at each moment.

intentions

The final output of the encoder is used to classify the intent with a text vector.

This model is based on the ATIS dataset (+aligned Inputs) :

Intent: 94.14 Slot (F1) : 95.62

Based on the above idea, another attentional RNN joint model is obtained in this paper


BiRNN the input for the hidden layer is


slot

The high-dimensional features obtained by BiRNN and the text vector concat are used as the input of single-layer decoderRNN for slot recognition. It should be noted that the output probability of encoder only applies to the forward transmission layer of BiRNN.

intentions

Single layer decoderRNN hidden layer of output weighted to get the final output vector, get the final intent classification

This model is based on the ATIS dataset (+aligned Inputs) :

Intent: 94.40 Slot (F1) : 95.78

5. The fifth part is mainly about Online intention and slot, language union model (online-NN -LU). The main reason why the above four joint models are not processed online is that they are all analyzed in the unit of the whole sentence and cannot be analyzed in real time. The highlight of this paper is the real-time analysis, the analysis of the optimal intention and slot and the prediction of a word for the input to the moment T so far.

Liu B, Lane I. Joint online spoken language understanding and language modeling with recurrent neural networks[J]. 2016.


The figure above shows a resolution of the current to time T:

intentions

W is the word sequence before T (including T), C is the intention before T, and S is the slot sequence before T. According to the above three as the input of RNN at the current time T, and the output of RNN hidden layer, different MLP layers are used to classify the intention and slot at the current time T respectively. At the same time the output of the hidden layer concat intention and slot information is input into the MLP layer to get the prediction of the next word.


The actual operation is as follows: LSTM is used to input the word sequence, intention and slot information of the last moment. IntentDist, SlotLabelDist and WordDist in the formula are MLP layers.


The training method is the maximum likelihood of the three modules mentioned above


Note that this article, due to the online algorithm, uses the greedy idea to achieve the current optimum based on previous intentions and slots.


The scores of the above models on ATIS


4. To summarize

The above models mainly solve the important intention and slot identification in the traditional task algorithm framework by deep learning method, and these models can be applied to a related task-based field in practice (I have used LSTM+CRF method to achieve the slot extraction of exchange rate in the project). Language understanding module for how to use the dialogue management module with which solve the problem of several rounds of dialogue is always a difficult problem to a more headaches, although on the traditional algorithm framework proposed some traditional models or reinforcement learning method, for example, but the data standardization, the dialogue process is not smooth, rigid, seriously affect the user experience in task-oriented dialogue. Recently, task-oriented models are mainly based on sequence-to-sequence model combined with knowledge base, which produces some unexpected surprises. The next chapter will mainly introduce such models.

Reference:

[1] Zhang X, Wang H. A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding[C] IJCAI. 2016

[2] Guo D, Tur G, Yih W, et al. Joint semantic utterance classification and slot filling with recursive neural networks[C] 2014 IEEE. IEEE, 2014

[3] Xu P, Sarikaya R. Convolutional neural network based triangular crf for joint intent detection and slot filling 2013 IEEE Workshop on. IEEE, 2013

[4] Liu B, Lane I. Attention-based recurrent neural network models for joint intent detection and slot filling[J]. 2016.

[5] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. 2014.

[6] Liu B, Lane I. Joint online spoken language understanding and language modeling with recurrent neural networks[J]. 2016.