This article is published by the Cloud + community

| guide language question answering system is an advanced form of information retrieval, can more accurately understand the user questions in natural language, and through the corpus retrieval, knowledge map, or return concise, accurate matching answers q&a knowledge base. Compared with search engines, question answering system can better understand the real intention of users to ask questions, and further meet users’ information needs more effectively. Question answering system (QSS) is a promising research field in the field of artificial intelligence and natural language processing.

One, the introduction

Question and answer system mainly includes the user’s questions and answers. According to the knowledge domain of question, question answering system can be divided into limited domain, open domain, and Frequently Asked Questions (FAQ). According to the answer source, question answering system can be divided into structured data based question answering system such as KBQA, text-based question answering system such as machine reading comprehension, and question answering system based on question answering pair such as FAQ. In addition, question answering system can be divided into question answering system based on retrieval and question answering system based on generation according to the feedback mechanism of answer.

This paper mainly describes the relevant research and processing framework of FAQBot retrieval question answering system, and the application of deep learning in it. FAQ retrieval question is based on the user’s new Query to find the most appropriate answer in THE FAQ knowledge base and feedback to the user. As shown in the figure:

Qi is the standard question in the knowledge base, and Ai is the corresponding answer to the standard question.

The specific processing process is as follows:

  • Candidate sets are indexed offline. Lucene engine is used to build word level inversion index for tens of thousands of similar query sets. The performance of the Lucene engine can control the recall time in milliseconds, greatly reducing the calculation pressure of subsequent modules;
  • After receiving query from users online, a batch of candidate sets are initially recalled as coarse sorting results and passed to the next module for further precise sorting.
  • Matching model is used to calculate the matching degree of questions or answers in user query and FAQ knowledge base.
  • Rerank the candidate set using the Ranking model and return topk candidate answers.

It can be seen that the core task of FAQ system can be abstracted into text matching task. Traditional text matching methods, such as BM25 in information retrieval and VSM in vector space model, mainly solve the problem of literal similarity. However, due to the richness of Chinese meaning, it is often difficult to determine semantic similarity between two sentences directly based on keyword matching or shallow model based on machine learning. In recent years, using neural network, especially deep learning model to learn the deep semantic features of text, semantic matching after semantic representation of text has been proposed and applied to the retrieval question answering system. On the one hand, the model based on deep learning can save a lot of manpower and material resources to extract features manually. In addition, compared to traditional methods, the depth of the text matching model can be automatically extracted from large sample of the relationship between words, and could combine the phrase structure information and the text in the match the hierarchical characteristics of the traditional model is difficult to excavating the implicit meaning in a large amount of data is not obvious characteristics, more finely description text matching problem.

Second, deep learning text matching

FAQ question answering system generally has two solutions. One is similar question matching, that is, compare the similarity between the user’s question and the question in the existing FAQ knowledge base and return the most accurate answer corresponding to the user’s question. This kind of thinking is similar to text paraphrase. The other is question answer matching, that is, the most accurate answer corresponding to the user’s question is returned by comparing the matching degree between the user’s question and the answer in the FAQ knowledge base. This idea is answer selection, that is, QA matching. The similarity between these two types is that both can be regarded as text semantic matching, and many models can achieve good results on both tasks at the same time. The difference is that QA matching has problems and answers of different qualities.

The following is a summary of some text matching work based on deep learning, hoping to throw a brick into the light. If there are omissions or mistakes, welcome to add or point out.

2.1 Model Framework

In general, deep semantic matching models can be divided into two categories, namely, reaction-based method and interaction-based method.

1) Represention-based Method

The framework diagram is as follows:

These algorithms first represent the two objects to be matched by deep learning model, and then calculate the similarity between the two representations to output the matching degree of the two objects. In this way, more emphasis is placed on the construction of the presentation layer, so that it can fully convert the two objects to be matched into equal length semantic representation vector. Then on the basis of two semantic representation vectors corresponding to two objects, the matching degree is calculated. There are usually two ways to calculate the match degree function f(x,y), as shown in the figure below: One is to calculate the match degree function f(x,y) through similarity measurement functions, which are most commonly used in practice. This way is simple and efficient, with controllable score interval and clear meaning. The other method is to connect the two vectors with a multi-layer perceptron network (MLP) to train and fit the data to get a matching score, which is more flexible and has stronger fitting ability, but also has higher training requirements.

Represention-based Extended

The problem of the above representation-based method is that the representation directly based on sentences is too rough, and the text matching task cannot be carried out accurately. Inspired by the field of information retrieval, better performance can often be achieved by combining matching information at the subject level and word level. Then the sentence representation is further extended to add fine-grained matching information. The framework diagram is as follows:

2) Interaction-based Method

The framework diagram is as follows:

The interaction-based approach models text similarity through Interaction. This method emphasizes the full interaction between the two sentences to be matched and the matching after the interaction. The sentence is not converted into a whole representation vector at the representation layer, but generally a set of representation vectors corresponding to the position of the word is reserved. Firstly, the sentence representation obtained by DNN or word embedding directly is adopted based on the representation layer. Each vector corresponding to the word location reflects certain global information with this word as the core. Then, the two sentences interact with each other according to the words, so as to build a matching pattern between the two texts, which includes more detailed and local text interaction information. Based on the matching matrix, DNN can be further used to extract higher-level matching features, and finally the final matching score can be calculated. The interaction-based method is more detailed and full in matching modeling. Generally speaking, it has better effect, but the calculation cost increases. Therefore, it is more suitable for some scenes that require high effect accuracy but not high computational performance.

Different types of deep learning text matching models are summarized below. It can be seen that there are a lot of existing work on deep text matching. This paper will introduce part of the work in recent years in detail, and others can be read in depth by referring to corresponding literatures.

  • representation-based:DSSM[1]; CDSSM[2]; ARC I[3]; CNTN[4]; LSTM-RNN[5]
  • representation-based extension:MultiGranCNN[6]; MV-LSTM[7]
  • interaction-based:ARC II[8]; MatchPyramid[9]; Match-SRNN[10]; DeepMatch[11]; ABCNN[12]; QA – LSTM/CNN – attention [13, 14]. AP[15]; AICNN[16]; MVFNN[17]; BiMPM[18]; DQI[22]; DIIN[23]

2.2 Model Introduction

2.2.1 ABCNN [12]

Firstly, BCNN is introduced, which is the basis of ABCNN model, that is, the model without Attention. The model structure is shown in the figure:

Input layer: Transform the input sentence into word vector after padding; Convolution layer: convolved sentence representation using wide Conv; Pooling layer: two pooling methods are used in this paper. One is that the last pooling layer is all-AP and the other is that the middle pooling layer is W-AP. The difference is that the window size is different when pooling; Output layer: Followed by logistic regression layer for 2 classification.

ABCNN adds two attention mechanisms on the basis of BCNN. The model results are shown as follows:

(1) Add attention to input layer

Its principle is to expand the input into two channels. The new channel is the Attention Feature Map, shown in blue. First, the attention matrix A was calculated, and each element Aij represented the match_score of the ith word in sentence 1 to the JTH word in sentence 2. Euclidean distance was used here. Then, the attention feature maps of the two sentences are calculated respectively. Two matrices W0 and W1 are multiplied by A and A’s transpose respectively to obtain the feature map with the same size as the original feature. W0 and W1 are both model parameters and can use the same W, that is, share two matrices. So we’ve expanded the original input into two channels.

(2) Add attention into the pooling layer

The calculation method of Attention matrix A is the same as the above. After obtaining A, the Attention weight vectors should be calculated for the two sentences, such as col-wise sum and row-wise sum of the two dotted lines in the figure above. Each element in the two vectors respectively represents the weight of the corresponding words in the Average Pooling. Equivalent to pooling is no longer a simple Average pooling, but the pooling obtained according to the calculated Attention weight vector.

2.2.2 LSTM/CNN, attention [13, 14]

Given a (q,a)pair, q is the question and A is the candidate answer. Then biLSTM is used for encoder to generate distributed representation of questions and answers, and cosine similarity is used to measure their distance. The training target is Hinge Loss.

On the basis of biLSTM output representation, CNN is further used to obtain local information between vectors output by biLSTM. To give more complex representations of questions and answers.

When biLSTM models propagate dependencies over long distances on questions and answers, the fixed width of hidden vectors becomes a bottleneck. The attentional mechanism can be used to mitigate this weakness by dynamically adjusting more information parts of a question’s answer. Before Max /mean pooling, each biLSTM output vector is multiplied by softmax weight, which is embedded by biLSTM problems.

2.2.3 Attentive Pooling Networks [15]

In QA_LSTM with Attention, attention is designed to be weighted by the impact of the question on the answer, but it ignores the impact of the answer on the question. Attentive Pooling Networks applies attention to both questions and answers to improve the accuracy of the algorithm. By learning the representation of two inputs and measuring their similarity at the same time, its innovation lies in projecting the two inputs Q and A into A common representation space through the parameter matrix U, and constructing A matrix G with the representation of Q and A. By Max pooling the row and column of G, the attention vectors of Q and A can be obtained respectively. The frame diagram of AP_BILSTM model is as follows:

In the design of AP_BILSTM model, the features of question and answer are extracted through BILSTM, and soft alignment is calculated by the features of the two. The OBTAINED G matrix represents the result of interaction between question and answer. Maximizing the column of the matrix is the importance score of the answer to the question, and similarly, maximizing the row of the matrix is the importance score of the question to the answer. These two vectors are then used as the attention vectors to multiply the question and answer representations and then match them.

2.2.4 AICNN [16]

Previous studies on answer selection have generally ignored redundancy and noise problems prevalent in data. In this paper, a novel attentional interactive neural network (AI-NN) is designed to focus on text fragments that contribute to answer selection. The representation of question answers is first learned by convolutional neural network (CNN) or other neural network architectures. The AI-NN then learns the interaction of each paired segment of the two texts. You then use row-by-row and column-by-column pooling to collect interaction information. Then the attention mechanism was used to measure the importance of each subdivision and the interaction was combined to obtain the fixed length representation of the question and answer. The model frame diagram is as follows:

2.2.5 MVFNN [17]

The neural network-based approach above considers several different aspects of information by calculating attention. These different types of attention are often summarised simply and can be viewed as a “single view”, failing to examine questions and candidate answers from multiple perspectives, leading to serious information loss. To overcome this problem, the model proposes a multi-view fusion neural network in which each concerned component generates a different “view” of QA pairs and fuses the characteristic representation of QA itself to form a more holistic representation. The model frame diagram is as follows:

For a question, there may be a bunch of views that simulate the corresponding answer. In this model, four views are intuitively built. These four views are named query type view, query active word view, query semantic view, and Co-attention view. Finally, the Fusion RNN model is used to fuse these views. By fusing different views, two objects can be modeled more accurately.

2.2.6 BiMPM [18]

For the method based on interaction, two sentence units are generally matched with each other first, and then aggregated into a vector for matching. This approach can capture the characteristics of the interaction between two sentences, but the previous approach is based on word-level matching but ignores information at other levels. In addition, matching is only based on one direction and ignores the opposite direction. Bilateral Multi-Perspective Matching (BiMPM), a two-way multi-perspective matching model, solves this problem. The model framework is shown as follows:

The model consists of five layers from bottom to top, namely, word representation layer, context representation layer, matching layer, aggregation layer and prediction layer. The matching layer is the core of the model, and four matching strategies are proposed. The matching here can be regarded as the attention mechanism.

Word representation layer: The GloVe model is used to train vector and randomly initialize character embedding. The vector representation of words composed of characters is used as input of LSTM network.

Context presentation layer: p and Q are represented using BiLSTM.

Matching layer: the core layer of the model, which contains four Matching strategies: full-matching, Maxpooling-matching, Attentive-Matching and max-attentive -Matching. The four matching strategies are shown as follows:

Aggregation layer: BiLSTM is used to process the output vector of the matching layer, and the output of the last time step in the forward and backward directions of P and Q is obtained and then connected and input to the prediction layer.

Prediction layer: Softmax layer, Softmax function classification.

The above is a summary of some deep text matching models in recent years. Next, FAQBot based on the deep model is introduced.

3. FAQBot based on deep learning

3.1 Modeling process

3.2 Data acquisition and construction

3.2.1 Data Acquisition

For scenarios with lots of questions and answers, such as intelligent customer service, there are lots of frequent points (questions and answers) in the notes. These high frequency knowledge points usually correspond to different questions. That is, the structure of knowledge base is a set of questions corresponding to the same answer. There are three types of FAQ data:

  1. Standard QUESTION Q: Standard user query for a question in the FAQ
  2. Answer A: Standard answer corresponding to A standard question in the FAQ
  3. Similar question q1 and q2… : a query that is semantically similar to a standard question and can be answered with the same answer

Where, the standard question Q, the corresponding answer A and all the similar questions corresponding to the standard question Q q1, Q2… , together form a knowledge point. An example of a knowledge point is shown below:

3.2.2 Data construction

Data construction has two aspects:

(1) Construction of training set and test set

Test set: the first similarity question Q1 is used as query, and 30 knowledge points are recalled from all knowledge points in FAQ knowledge base by Lucene as candidate set

Training set: contains two parts, one is the construction of positive examples, the other is the construction of negative examples, and the way of data construction in these two parts will directly affect the final effect. In the construction of positive examples, since the first similarity question of each knowledge point appears as a test set, the first similarity question Q1 of all knowledge points is excluded in the construction of training set. In this case, knowledge points with more than two similar questions and more similar questions can be used to construct training sets. Positive and negative examples can be constructed in different ways from the standard questions in these identification points and from the second similar question (i.e. [Q2, Q3,… qn]).

The construction of positive examples in the training set: remove the first similarity question Q1 in all knowledge points, and combine the other similarity questions into positive example pairs; For similar questions more knowledge points for cutting.

The construction of negative examples of training sets includes:

  • Recall according to Jaccard distance;
  • Recall by Lucene;
  • Random selection from other knowledge points;
  • Sample from other knowledge points according to the proportion of each problem in the positive example;
  • Each sentence and its noun/verb form a pair;
  • To solve the problem of imbalanced distribution of knowledge points, we cut off many knowledge points with similar questions.

(2) Data enhancement strategy

Since deep learning requires a lot of data, the following strategies are adopted to enhance the data:

  • Switch the order between two sentences;
  • The sentence segmentation, recombination to generate a new sentence;
  • Shuffle sentences and pick sentences at random.

3.3 Model Building

3.3.1 Model framework

The basic framework generally uses two encoder respectively to obtain the corresponding context information of the two sentences to be matched, and then matches the context information of the two sentences to obtain the matched feature information. You can also concat all of these features by adding some other traditional text features to the matched features. Finally connected to the Softmax layer, do the final classification. The framework of the model is shown in the figure below:

3.3.2 Model building and iterative optimization

Embedding: Trains word vector and character vector using word2vec and FastText.

Encoder layer: Convolution has the function of local feature extraction, so CNN can be used to extract key information similar to N-gram in sentences, considering the context information of the text. Therefore, textCNN[19] is adopted to encode and represent the sentences. The encoder process is shown in the following figure:

Matching layer: after getting the representation of two sentences, the representation of two sentences should be matched. Various types of matching methods can be constructed according to needs as shown in figure [20]. We use relatively simple element-wise addition and multiplication to carry out matching.

Join layer: After the common representation of the two sentences obtained after the matching layer, additional traditional features are further introduced for join operation, similar to the figure below [21].

The above steps do not consider the association between the two sentences when encoder the two sentences. Therefore, more detailed and local sentence interaction information is further introduced to capture the interaction characteristics between two sentences, and the new representation of two sentences can be obtained according to the matrix obtained by interaction. As shown in figure:

Introduce the attention mechanism: Use the attention mechanism to measure the difference in the importance of different parts of a sentence using weight vectors. The main idea of attention calculation follows several kinds of attention in AICNN and ABCNN, which are feature attention, new expression after interaction and attention between original expression of sentence respectively.

Iv. Summary and outlook

4.1 Data Layer

  • Establish a more reasonable knowledge base: each knowledge point contains only one intention, and there is no overlapping, ambiguity, redundancy and other factors that are easy to cause confusion among knowledge points
  • Annotation: Accumulate a number of representative similar questions for each FAQ
  • Later continuous maintenance: including discovery of new FAQ, merging, splitting, and correction of original FAQ, etc

4.2 Model Level

  • To further capture syntactic level and semantic level knowledge such as SRL, Semantic role Labelling and POS, part of Speech tagging, etc. It is introduced into the text representation to improve the effect of text semantic matching
  • Most of the current work in retrieving row answers does question to question matching, or question to answer matching. Subsequently, information about questions and answers can be introduced for modeling, as shown in the figure:

reference

[1] Huang P S, He X, Gao J, et al. Learning deep structured semantic models for web search using clickthrough data[C]// ACM International Conference on Conference on Information & Knowledge Management. ACM, 2013:2333-2338.

[2] Shen Y, He X, Gao J, et al. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval[C]// Acm International Conference on Conference on Information & Knowledge Management. ACM, 2014:101-110.

[3] Hu B, Lu Z, Li H, et al. Convolutional Neural Network Architectures for Matching Natural Language Sentences[J]. Advances in Neural Information Processing Systems, 2015, 3:2042-2050.

[4] Qiu X, Huang X. Convolutional neural tensor network architecture for community-based question answering[C]// International Conference on Artificial Intelligence. AAAI Press, 2015:1305-1311.

[5] Palangi H, Deng L, Shen Y, et al. Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2016, 24 (4) : 694-707.

[6] Yin W, Schütze H. MultiGranCNN: An Architecture for General Matching of Text Chunks on Multiple Levels of Granularity[C]// Meeting of the Association for Computational Linguistics and the, International Joint Conference on Natural Language Processing. 2015:63-73.

[7] Wan S, Lan Y, Guo J, et al. A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations[J]. 2015:2835-2841.

[8] Hu B, Lu Z, Li H, et al. Convolutional Neural Network Architectures for Matching Natural Language Sentences[J]. Advances in Neural Information Processing Systems, 2015, 3:2042-2050.

[9] Pang L, Lan Y, Guo J, et al. Text Matching as Image Recognition[J]. 2016.

[10] Wan S, Lan Y, Xu J, et al. Match-SRNN: Modeling the Recursive Matching Structure with Spatial RNN[J]. Computers & Graphics, 2016, 28(5):731-745.

[11] Lu Z, Li H. A deep architecture for matching short texts[C]// International Conference on Neural Information Processing Systems. Curran Associates Inc. 2013:1367-1375.

[12] Yin W, Schütze H, Xiang B, et al. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs[J]. Computer Science, 2015.

[13] Tan M, Santos C D, Xiang B, et al. LSTM-based Deep Learning Models for Non-factoid Answer Selection[J]. Computer Science, 2015.

[14] Tan M, Santos C D, Xiang B, et al. Improved Representation Learning for Question Answer Matching[C]// Meeting of the Association for Computational Linguistics. 2016:464-473.

[15] Santos C D, Tan M, Xiang B, et al. Attentive Pooling Networks[J]. 2016.

[16] X Zhang, S Li, L Sha, H Wang. Attentive Interactive Neural Networks for Answer Selection in Community Question Answering[C]// International Conference on Artificial Intelligence.

[17] L Sha, X Zhang, F Qian, B Chang, Z Sui. A Multi-View Fusion Neural Network for Answer Selection[C]// International Conference on Artificial Intelligence.

[18] Wang Z, Hamza W, Florian R. Bilateral Multi-Perspective Matching for Natural Language Sentences[C]// Twenty-Sixth International Joint Conference on Artificial Intelligence. 2017:4144-4150.

[19] Kim Y. Convolutional Neural Networks for Sentence Classification[J]. Eprint Arxiv, 2014.

[20] Wang S, Jiang J. A Compare-Aggregate Model for Matching Text Sequences[J]. 2016.

[21] Severyn A, Moschitti A. Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks[C]// The International ACM SIGIR Conference. ACM, 2015:373-382.

[22] Xiaodong Zhang, Xu Sun, Houfeng Wang. Duplicate Question Identification by Integrating FrameNet with Neural Networks[C]//In the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

[23] Gong Y, Luo H, Zhang J. Natural Language Inference over Interaction Space[J]. 2018.

This article has been published by Tencent Cloud + community authorized by the author

For more fresh technology dry goods, you can pay attention to our Tencent cloud technology community – cloud plus community official number and Zhihu organization number