The public account follows “ML_NLP” \
Set as “star standard”, heavy dry goods, the first time to arrive!
Machine learning algorithms and natural language processing
@public account original columnist Liu Cong NLP
School | NLP algorithm engineer
The relevant papers on zhihu column | natural language processing
Short text similarity, namely, to solve the similarity between two short texts; It is a special form of the text matching task or text implication task that returns a specific numerical value of the degree of similarity between texts. However, in industry, short text similarity calculation plays an important role.
For example, in the question answering system task (question answering robot), we often artificially configure some commonly used and clearly described questions and their corresponding answers, and we call these configured questions “standard questions”. When a user asks a question, the similarity between the user’s question and all the configured standard questions is calculated. The standard question most similar to the user’s question is found and the answer is returned to the user. In this way, a q&A operation is completed.
At present, short text similarity algorithms can be divided into three categories :(1) unsupervised similarity calculation; (2) supervised similarity calculation; (3) Supervised + unsupervised similarity calculation.
I. Unsupervised similarity calculation
Firstly, word vector is trained by word2VEc in large-scale corpus, then word segmentation is performed on the short text and word vector corresponding to each word is found. Finally, the sentence vector of the short text is obtained by summation of all the word vectors (or weighted summation according to the part of speech or rules). The similarity value of two short text sentence vectors is finally obtained by distance measurement.
The distance measurement methods include :(1) Euclidean distance; (2) cosine distance; (3) Manhattan distance; (4) Chebyshev distance; (5) Minkowski distance; (6) Mahalanobis distance; (7) Standardized Euclidean distance; (8) Hamming distance; (9) Jeckard distance; (10) Correlation distance.
Examples of Euclidean distance and cosine distance formula:
(1) Euclidean distance, also known as Euclidean distance, is the most common distance measure, which measures the absolute distance between two points in multi-dimensional space.
Points on the plane in two dimensions 与 Euclidean distance between:
A point in an n-dimensional vector space 与 Euclidean distance between:
The smaller the Euclidean distance value is, the closer the two vectors are to each other, that is, the more similar the two vectors are.
(2) Cosine distance, that is, the cosine of the Angle between two vectors is used to measure the difference between the directions of two vectors.
Vectors in two dimensions 与 The Angle cosine formula of
N dimensional vector space 与 The cosine of the included Angle between
The range of cosine of the included Angle is [0,1]. The larger the cosine, the smaller the Angle between the two vectors, the more similar the two vectors are.
Although the unsupervised learning method can calculate the similarity value quickly, and does not need to mark the prediction. However, the quality of sentence vector is often dependent on artificial word weights, which are easily affected by administrative factors, and the effect is often not satisfactory.
2. Supervised similarity calculation
Supervised similarity calculation is carried out under the condition of annotation expectation, that is, deep learning modeling is carried out based on annotation data, and the similarity value of short texts can be directly solved through end-to-end learning of the model. At present, it is divided into three types of frame structure :(1) Siamese “frame; (2) “interactive aggregation” architecture; (3) “pre-training” framework.
1. “Siam” architecture: in this framework, two short texts are input to the same deep learning encoder (such as CNN or RNN) respectively, so that the two sentences are mapped to the same space. Then the distance between the two sentence vectors is measured, and finally the similarity value of the short texts is obtained. The representative model is twin network [1], as shown in Figure 1.
FIG. 1 Structure diagram of twin network model
The advantage is that shared parameters make the model smaller and easier to train. Although the sentence vector of the short text is derived from the model, it has certain semantic information. However, the disadvantage can not be ignored. In the process of mapping, there is no clear interaction between two texts, and a lot of mutually affecting information will be lost.
2. “Interactive Aggregation” architecture: This frame structure is proposed in view of the shortcomings of the first frame structure. The first part of its frame is the same as that of the first frame, which is to input two texts into the same deep learning encoder to obtain two sentence vectors, but then the distance between the two sentence vectors is not directly measured. Instead, the two sentence vectors are integrated into a vector by information interaction through one or more attention mechanisms, and the similarity value of the short text is obtained through value mapping (all connected to a node). Representative models include ESIM [2], BiMPM [3], DIIN [4] and Cafe [5], etc. As shown in Figure 2,
FIG. 2 Structure diagram of BiMPM model
This framework captures more of the interaction between two sentences, so it is a significant improvement over the first framework.
3. “Pre-training” architecture: This framework is from the recently popular pre-trained model (representative models: ELmo [6], GPT [7], BERT [8], etc.). It mainly adopts the model of two phase, the first stage USES a lot of general training corpus, a language model, the second phase of the use of language in the process of the training model similarity computing tasks, the two text to the preliminary training model, the vector after get the information interaction, and through the value mapping (whole company received a node) for short text similarity values. As shown in Figure 3,
Figure 3 Structure diagram of BERT model
This framework usually has a large number of parameters, and it also achieves some improvements over the second framework due to its use of a large common corpus, which makes it more universal and can capture more subtle interaction features between two short texts.
Although the supervised learning method is more accurate than the unsupervised learning method in obtaining the similarity of short texts; However, the supervised learning model network has high computation requirements. For example, if we want to find a standard question from the standard question library that is most similar to the user question, each time a new user question comes in, we do a new encoding calculation with all the standard questions. If the standard library is large, it takes a lot of time. In a fact-finding scenario, it’s hard to wait that long. Is that why the supervised learning model works well, but the real question answering robot still uses an unsupervised approach?
Supervised + unsupervised similarity calculation
Considering the advantages and disadvantages of unsupervised learning and supervised learning, we can combine them together to improve the accuracy of unsupervised learning and reduce the time cost of supervised learning.
So how do you do that?
(1) The disadvantage of unsupervised learning lies in the sentence vector generation of text. Artificially weighted sum of word vectors cannot get a good sentence vector, and the sentence vector obtained does not contain the semantic information of context.
We can use supervised learning to get the sentence vector of a short text. As mentioned above, we can obtain the sentence vectors of short texts through the twin network. Although there is no direct mutual information between the two sentence vectors, they contain the contextual semantic information of their respective texts.
(2) The time complexity of supervised learning is too high. In order to avoid a new text every time, all the texts are calculated. We abandon the interaction between texts and go straight to generating good sentence vectors.
At the start stage of question answering system or question answering robot, all sentence vectors of marked questions in standard question library can be calculated and stored. When a user question comes, we only need to ask the sentence vector to solve the user question, and then measure the distance with the standard question vector stored in the library, and finally we can get the most similar standard question. Save a large part of the time cost.
(3) Supervised learning can obtain sentence vector better than unsupervised learning, so how to improve the quality of sentence vector? Before 2018, people used CNN or LSTM to encode their short texts in order to obtain better sentence vector, but the effect was general and the degree of improvement was limited compared with unsupervised learning. At the same time, it is necessary to mark expectations, which is not obvious in the industry. In 2018, with the advent of BERT, the acquisition of sentence vectors has been raised to a new level. With the help of huge pre-trained corpus and model parameters, BERT model dominates the top of all lists. We can replace CNN or LSTM structure in the original twin network with BERT model to obtain sentence vectors with more semantic information, making it possible to use supervisory model in industry. Representative model: Sentence-Bert [9].
Due to the huge BERT model, we can also use the method of distilling the model (we can share it again if there is an opportunity) to reduce the time cost.
conclusion
All the short text similarity calculation methods introduced in this paper are based on word vector. There are also some traditional methods, such as TF-IDF, theme model, etc. Interested students can learn by themselves.
Reference:
[1]Learning a similarity metric discriminatively, with application to face verification
[2]Enhanced LSTM for Natural Language Inference
[3]Bilateral Multi-Perspective Matching for Natural Language Sentences
[4]Natural Language Inference over Interaction Space
[5]Compare, Compress and Propagate: Enhancing Neural Architectures with Alignment Factorization for Natural Language Inference
[6]Deep contextualized word representations
[7]Improving Language Understanding by Generative Pre-Training
[8]BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[9]Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code