In an era of fragmented reading, fewer and fewer people pay attention to the exploration and thinking behind each paper.
In this column, you will quickly get the highlights and pain points of each selected paper and keep up with the cutting-edge AI achievements.
Click “Read the original article” at the bottom of this article to join the community now and see more of the latest paper recommendations.
11
This is the 11th issue of PaperDaily.
This issue of paper notes is a NLP feature brought to you by PaperWeekly community users @JeffreyGao, @RobertDlut and @LC222, covering conversational systems, named Entity Recognition (NER) and QA systems.
If you have a paper you like, copy the link to your browser to see the original.
Dialogue system
S paper | Affective Neural Response Generation
S link | http://www.paperweekly.site/papers/1043
S the author | Jeffreygao
1. Thesis motivation
The paper is from Huawei Noah’s Ark Laboratory.
It is said that artificial intelligence should have emotions and be able to feel people’s happiness, sadness and happiness. Today, I will introduce a chatbot with emotions. In previous studies, most dialogue systems only focus on whether the syntax and semantics of the generated dialogue are reasonable, which include considering the context, combining with the theme, generating and growing sentences, etc. But very few conversational systems focus on emotion, which is quite unreasonable. Because in conversation, when one person expresses sadness, the other often responds with appropriate comfort; When one partner is happy, the other will be happy for them. For example, when person A says, “My dog passed away,” it’s natural for person B to respond with something like, “I’m sorry for you,” and there are countless examples of this emotional interaction in our daily conversations.
In fact, from my personal point of view, this is also a good research direction after open domain dialogue. After all, there are too many papers to solve the problem of how not to produce meaningless answers like “hehe”. Only real simulated human conversation is a good chatbot.
2. Related work
In this article, the author mentions two related work: one is the Affect Language Model in ACL2017 [Ghosh et al. 2017], The second one is an Emotion Chatting Machine (ECM) [Zhou et al. 2017]. Judging from the arXiv, it should be submitted to AAAI2018.
The ECM model is pretty nice, but the problem is poorly defined. Where is the problem? This model is giving you the context, giving you the emotion of the response that you want to generate, and then generating a response with that emotion. But in everyday conversation, there is no emotion that God assigns you to respond to.
Model 3.
The model of this paper takes SEQ2SEQ as the background and mainly improves in three points. (1) Add emotional information on embedding (2) improve loss function (3) consider emotion when beam search.
3.1 Word vector with emotion
The VAD dictionary is used here, explained as follows: “Valence (V, the pleasantness of a stimulus), excite (a, the intensity of emotion produced, or the degree of arousal evoked, by a stimulus), and Dominance (D, the degree of power/control exerted by a stimulus). V– 1, 5 and 9 corresponds to a word being very negative (e.g., pedophile), neutral (e.g., tablecloth) and very positive (e.g., happiness), respectively. A– 1, 5 and 9 corresponds to a word having very low (e.g., dull), moderate (e.g., watchdog), and very high (e.g., insanity) emotional intensity, respectively. D– 1, 5 and 9 corresponds to a word/stimulus that is very powerless (e.g., dementia), neutral (e.g., Waterfall) and very powerful (e.g., paradise) respectively.”
To put it simply, there is a word vector scale that has been evaluated by others. The word vector dimension is three-dimensional, and each dimension adopts one aspect of emotion respectively, which is represented as :(V_score, A_score, D_score). If the words in the training set are in the VAD table, take the word vector; [5, 1, 5] is used as a neutral word.
In this paper, the specific approach is to connect the traditional word vector and the emotion vector W2VA together as the input of encoder and decoder. In fact, this method is the same as the first step of the emotion Chatting machine, but in the ECM, the emotion vector is learned, no external dictionary.
3.2 Objective function with emotion
The author puts forward three different loss functions here, which, in my opinion, is out of necessity because there is no better way to integrate these things.
3.2.1 Minimize affective dissonance
In this loss function, the assumption made by the author is that when two people chat, their emotions will not change too quickly or frequently. For example, if you say a friendly word, I will say a friendly word back to you, which is polite. You say something aggressive, and I’ll say something angry. Naturally, loss function should consider not only cross entropy but also whether the emotion generating response is close to the emotion input, which is measured by Euclidean distance. As follows:
3.2.2 Maximized affective dissonance
Here, the author makes another hypothesis, for example, two people who are not familiar with each other are chatting. If one person is too friendly, the other person may dislike him or her. Then try to make the two sentences as emotionally inconsistent as possible by changing the sign of the second term in the formula above. The loss Function symbol is LDMAX.
3.2.3 Maximize emotional content
The idea here is that the model produces sentences that are clearly emotional in order to avoid generating boring words, but you don’t need to specify whether the emotion is positive or negative. Turning to the ground is another way of avoiding the “hehe”, “I don’t know”. Loss function is as follows:
3.3 Decoding with emotional diversity
This decoding is based on the Wise Beam search, which divides the Top B sequences into G groups and then adds an additional penalty term. The purpose of this punishment item is to make the things searched at the next moment in Group A as different as possible from those searched at the next moment in other groups, thus achieving the purpose of diversity. There are two levels to measure, one is the emotion of a single word, the other is the emotion of the whole sentence. Cosine functions are used to measure the similarity.
4. Experiment
In this paper, Cornell movie dialogue data set is used. Instead of using BLEU, ROUGE, METEOR and other indicators as indicators, someone is used for evaluation. I can understand this approach, because indicators like BLEU are actually of little significance for this dialogue, and the effect may not be as good as that of others. Note that in the loss function method using emotion, 40 epochs were trained using the common cross entropy objective function first, and then 10 epochs were trained using the specific Loss function. It is mentioned that if specific Loss function training is used at the beginning, the syntax for generating the answers will be poor.
5. Impressions of the model
In general, seQ2SEQ has improved emotion in all three necessary steps. However, the method still looks a little reluctant. There is no obvious modeling of emotion interaction, but it is a preliminary attempt at emotion in the dialogue system.
6. References
[Ghosh et al. 2017] Ghosh, S.; Chollet, M.; Laksana, E.; Morency, L.-P.; And Scherer, S. 2017. Effect-lm: A Neural Language Model for Customizable affective text Generation. In ACL, 634 — 642.
[Zhou et al. 2017] Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; and Liu, B. 2017. Emotional chatting machine: Emotional conversation generation with internal and external memory. arXiv preprint arXiv:1704.01074.
Named entity recognition
S paper | End – to – End Sequence Labeling via Bi – directional LSTM CNNs — CRF
S link | http://www.paperweekly.site/papers/367
S the author | robertdlut
In recent years, neural network-based deep learning methods have made a lot of progress in the field of natural language processing. Named Entity Recognition (NER), the basic task of NLP, is no exception, and neural network architecture has also achieved good results in NER. Among them, just as in this paper, many similar NN-CRF structures appeared in the same period to conduct NER, and became the mainstream NER model at present, achieving good results. Here is a summary to share learning with you.
1. The introduction
Named Entity Recognition (NER) is to identify relevant entities from a piece of natural language text and label their locations and types, as shown in the figure below. It is the basis for some complex tasks in the NLP domain, such as relationship extraction, information retrieval, and so on. NER has always been a research hotspot in the field of NLP. From early methods based on dictionaries and rules, to traditional machine learning methods, to methods based on deep learning in recent years, the general trend of NER research progress is shown in the figure below.
In machine learning-based approaches, NER is treated as a sequence labeling problem. Compared with the classification problem, the current prediction tag in the sequence labeling problem is not only related to the current input feature, but also related to the previous prediction tag, that is, there is a strong interdependence between the prediction tag sequences. For example, when BIO is used for NER, the correct tag sequence does not have tag I followed by tag O.
In traditional machine learning, Conditional Random Field (CRF) is the mainstream model of NER. Its objective function not only considers the input state characteristic function, but also includes the label transfer characteristic function. SGD can be used to learn model parameters during training. When the model is known, it is a dynamic programming problem to calculate the predictive output sequence for the input sequence, that is, the optimal sequence to maximize the objective function, which can be decoded by viterbi algorithm. However, CRF relies on feature engineering, and commonly used features are as follows:
Next we will focus on how to use neural network architecture to perform NER.
2. The mainstream neural network structure in NER
2.1 – NN/CNN CRF model
In fact, earlier than this paper, “Natural Language Processing (Almost) from Scratch” used neural networks for NER.
In this paper, the author proposes two network structures of window method and sentence method for NER. The main difference between the two structures is that the window method only uses the context window of the current prediction word for input, and then uses the traditional NN structure. The sentence method takes the whole sentence as the input of the current prediction word, adds the relative position features in the sentence to distinguish each word in the sentence, and then uses a layer of convolutional neural network CNN structure.
In the training phase, the author also gives two objective functions: one is logarithmic likelihood at word level, that is, softmax is used to predict label probability, which is regarded as a traditional classification problem; The other is logarithmic likelihood at sentence level. In fact, considering the advantages of CRF model in sequence labeling, the label transfer score is added into the objective function. Many related work later called this idea combining a CRF layer, so I call it NN/CNN-CRF model here.
In the author’s experiment, the structure effects of NN and CNN mentioned above are basically the same, but the effect of NER is significantly improved by adding CRF layer to the sentence level likelihood function.
2.2 RNN model of the CRF
Referring to the CRF ideas above, a series of NER works using RNN structure combined with CRF layer appeared around 2015, including this paper. The work of deputies mainly includes:
All these work can be summarized into an RNN-CRF model. The model structure is shown as follows:
It is mainly composed of Embedding layer (word vector, character vector (CNN is used as well as RNN in this paper, but the effect is similar), bi-directional RNN layer, tanH hidden layer and finally CRF layer. The main difference between NN/ CNN-crf and NN/CNN-CRF is that it uses two-way RNN instead of NN/CNN. Here RNN is usually LSTM or GRU.
The experimental results show that RNN-CRF achieves better results and has reached or exceeded the CRF model based on rich features, making it the most mainstream model of NER method based on deep learning at present. In terms of features, the model inherits the advantages of the deep learning method. Without feature engineering, word vector and character vector can be used to achieve good results. If there are high-quality dictionary features, it can be further improved.
3. Some recent work
In the recent year, NER research based on neural network structure is mainly focused on two aspects: one is to use popular Attention Mechanism to improve the model effect (Attention Mechanism), and the other is some studies on a small amount of labeling training data.
3.1 Attention – -based
The paper “Attending to Characters in Neural Sequence Labeling Models” focuses on improving the stitching of word vector and character vector on the basis of RNN-CRF model structure. Attention mechanism is used to improve the original character vector and word vector joining to weight summation, and two hidden layers of traditional neural network are used to learn the weight of attention, so that the model can dynamically incorporate word vector and character vector information. Experimental results show that the method is better than the original splicing method.
Another paper, Phonologically Aware Neural Model for Named Entity Recognition in Low Resource Transfer Settings, On the original BilSTM-CRF model, phonological features are added, and attention mechanism is used in character vector to learn and pay attention to more effective characters. The main improvements are shown in the figure below.
3.2 A small amount of annotation data
For deep learning methods, a large amount of annotated data is generally required, but there is not a large amount of annotated data in some fields. Therefore, how to use a small amount of annotated data in NER based on neural network structure method is also the focus of recent research. They include Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks and semi-supervised Learning.
Semi-supervised sequence tagging with bidirectional language models is a paper recently accepted by ACL2017. In this paper, a bidirectional neural network language model is trained using a massive untagged corpus, and then the trained language model is used to obtain the LM embedding of the current word to be annotated, which is then added into the original bidirectional RNN-CRF model as features.
Experimental results show that adding this language model vector can greatly improve the NER effect on a small amount of annotated data, while adding this language model vector can still provide the effect of the original RNN-CRF model on a large amount of annotated training data. The overall model structure is shown as follows:
4. To summarize
I will not introduce the experimental part of the paper, and finally make a summary. NN/CNN/RNN-CRF model combining neural network and CRF model has become the mainstream model of NER at present. In my opinion, neither CNN nor RNN has an absolute advantage, and each has its own advantages. Because RNN has a natural sequence structure, RNN-CRF is more widely used.
NER method based on neural network structure inherits the advantages of deep learning method without a lot of artificial features. Just the word vector and character vector can reach the mainstream level, and the inclusion of high-quality dictionary features can further improve the effect. For a small number of labeled training set problems, transfer learning and semi-supervised learning should be the focus of future research.
5. References
[1] Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the eighteenth international conference on machine learning, ICML. 2001, 1: 282-289.
[2] Sutton C, McCallum A. An Introduction to Conditional Random Fields. Foundations and Trends® in Machine Learning, 2012, 4 (4) : 267-373.
[3] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011, 12(Aug): 2493-2537.
[4] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition. Proceedings of NAACL-HLT. 2016: 260-270.
[5] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
[6] Ma X, Hovy E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354, 2016.
[7] Chiu J P C, Nichols E. Named Entity identification with bistM-cnns. ArXiv Preprint arXiv:1511.08308, 2015.
[8] Rei M, Crichton G K O, ArXiv Preprint arXiv:1611.04361, 2016. Pyysalo S. Meeting to Characters in Neural Sequence Labeling Models. ArXiv Preprint arXiv:1611.04361, 2016.
[9] Akash Bharadwaj, David Mortensen, Chris Dyer, Jaime G Carbonell. Phonologically aware neural model for named entity recognition in low resource transfer settings. EMNLP, pages 1462–1472, 2016.
[10] Yang Z, Salakhutdinov R, Cohen W W. Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks. ICLR, 2017.
[11] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, Russell Power. Semi-supervised sequence tagging with bidirectional language models. ACL, 2017.
The QA system
S paper | Gated End – to – End the Memory Networks
S link | http://www.paperweekly.site/papers/1073
S the author | lc222
This paper is based on the End-To-End Memory Networks To make some modifications. Because end-to-end Memory Networks are not very effective in multi-fact QA, positional reasoning, dialog and other fields, Therefore, referring to the shortcut connections involved by HighWay Networks and Residual Networks in the CV field, this paper introduces the Gated mechanism to achieve regularization of memory. This allows the model to modify memory dynamically.
Since the End-To-End Memory Networks are familiar with, let’s first introduce the idea of Highway Networks. It mainly introduces a Transform Gate T and a carry Gated C before the next layer of network output to let the network learn what and how much information should be transmitted to the next layer. We assume that the output of the network at this layer is y=H(x), then add the following mapping function:
Often we will choose C = 1-t, so the above formula can be converted to:
However, residual network can be regarded as a special case of Highway network, because it directly regards T and C as I, so it is equivalent to Y =H(x) + x. But I have not yet understood the principle behind this, why it is so easy to make the deeper network training success, and then have time to look at the relevant papers to learn.
Then let’s see how To integrate it into the end-to-end Memory Networks. Since the function of each hop can be regarded as U ‘=H(u), corresponding To the above formula, U is equivalent To input X, o is equivalent To output Y, so substitute into the above formula:
That is, modify the output layer formula in the original model. Then parameters W and B have two ways: global and each hop is independent. The subsequent experimental results prove that it is better to keep each hop independent. The innovation of the paper is not very big, just a combination of the two papers, but there seems to be a big improvement in the experimental effect. The final model architecture diagram is shown below:
The experimental results
The proposed model works well not only on the bAbI data set, but also on the Dialog bAbI conversation data set. This data set should be covered in a later article, so it won’t be covered here. Here are two pictures of the results:
The second figure reveals the weight calculation of MemNN and each hop of the model proposed in this paper for each sentence. It can be seen that the model in this paper is more concentrated on the most important sentence, while MemNN is relatively scattered, which also indicates that the model in this paper is better.
This paper is selected and recommended by PaperWeekly, an AI academic community, which covers natural language processing, computer vision, artificial intelligence, machine learning, data mining and information retrieval. Click “Read the original article” and join the community immediately!