As one of the hot research directions of NLP, text similarity calculation has been widely used in search and recommendation, intelligent customer service, chat and other fields. In different application fields, there are also some differences. For example, in the search field, the similarity between Query and Document is mostly calculated. In the field of intelligent customer service and chat, more attention is paid to the matching between Query and Query, that is, the similarity calculation between short texts.

There are also differences in the similarity calculation schemes with different text lengths. Long text matching pays more attention to the matching of keywords or topics in the text. The algorithms used in the industry are more like TF-IDF, LSA and LDA. Short text matching is more about the semantic consistency of the whole sentence. The mainstream algorithms in the industry include Word2vec, ESim, ABCNN, Bert and other deep models.

Compared with the similarity calculation of long text, the similarity calculation of short text is more challenging. Firstly, the text can be used in the context of limited information, semantic description is not comprehensive enough; Second, short texts are usually more colloquial, so there is a greater possibility of default. Thirdly, short text pays more attention to the matching of the overall semantics of the text, and is more sensitive to the word order and sentence pattern of the text.

query1 query2
I’m gonna call you I want to hit you
What’s your name What did you call me
My name is bush My name is not Boo
Do you have a male ticket Are you single
You are so funny You’re a funny guy
I like watching anime Don’t you know I like watching anime

The score distribution of different text similarity algorithms is inconsistent, so it is impossible to evaluate the algorithms by scoring. Therefore, a specific score threshold can be set for different algorithm schemes. If the score is higher than the threshold, it can be judged that the semantics are the same. Otherwise, it is judged that the semantics are different. For a data set with a given label, the effectiveness of similarity calculation can be measured by accuracy. Commonly used Chinese evaluation materials include LCQMC, BQ Corpus, Paws-X (Chinese), AFQMC, etc.

1. Mainstream programmes

Short text similarity calculation schemes commonly used in the industry can be roughly divided into two categories: supervised learning and unsupervised learning. In general, supervised learning has a relatively good effect. In cases where there is not enough training data to require a cold start, unsupervised learning is preferred for on-line.

1.1 Unsupervised learning

The simplest and most effective unsupervised learning scheme is the pre-training method, which uses the pre-training model such as word2vec or Bert to pre-train the unlabeled data in the task domain. The obtained pre-training model is used to obtain the semantic representation of each word and sentence for similarity calculation.

Word2vec is an epoch-making product in the field of NLP. It transforms word characterization from discrete one-hot to continuous embedding, which not only reduces the computational dimension, but also achieves a qualitative leap in the effect of various tasks. Word2vec models language models on a large scale, so that semantically similar words are highly relevant in the embedding representation.

CBOW or max-pooling is used to calculate the sentence embedding of each word in the sentence, enabling the sentence embedding with similar semantics to be highly relevant in the presentation of the sentence embedding. Compared with the traditional TF-IDF similarity calculation has better generalization. However, as CBOW is used to calculate sentence embedding, all words in the sentence use the same weight, which makes it impossible to accurately obtain the keyword in the sentence, resulting in limited accuracy of semantic computing and difficulty to meet the online standards.



Although Word2vec provides a certain degree of generalization, its biggest weakness is that in different contexts, the representation of the same word is exactly the same, so it cannot meet the rich language changes. The emergence of large-scale pre-training models such as GPT and BERT completely solves this problem and makes the representation of WORD contextually relevant. Meanwhile, the list of tasks in various fields is constantly refreshed.

However, it has been proved that the sentence embedding is calculated directly by using the Bert output token embedding to average all token embedding or by using the [CLS] token embedding directly. Semantic computing is not so good, even worse than Glove. The reason is that in the pre-training process of Bert, the co-occurrence probability of high-frequency words is higher, and MLM task training makes their semantic representations more similar, while the distribution of low-frequency words is more sparse. The uneven distribution of semantic space leads to the existence of many semantic “holes” around low-frequency words. Due to the existence of these “holes”, the similarity degree of semantic calculation is biased.

In order to solve the problem of the inhomogeneity of the Bert semantic space, CMU and Bert Flow, in cooperation with Bytesdance, proposed to map the Bert semantic space to a standard Gaussian hidden space. Since the standard Gaussian distribution satisfies the isotropy, there is no “hole” in the region and the continuity of the semantic space will not be broken.

The training process of Bert-flow is to learn a reversible mapping F, which maps the variable Z following the Gaussian distribution to U encoded by Bert, and then U can be mapped to a uniform Gaussian distribution. At this time, we maximize the probability of generating Bert representation from the Gaussian distribution, and then we learn this mapping:

Experiments show that the effect of semantic representation and similarity calculation by Bert-flow is far better than that by Word2vec and direct use of Bert.

1.2 Supervised Learning

The emergence of Bert-flow has made great progress in the text similarity calculation of unsupervised learning, but compared with supervised learning in specific tasks, there is still a certain gap in the effect. Similarity calculation models commonly used in supervised learning can be roughly divided into two categories: semantic representation model and semantic interactive model. Semantic representation model is commonly used in mass query recall, and interactive model is more used in semantic sorting stage.

DSSM is one of the most commonly used semantic representation models in search field. In short text matching field, twin network is the most commonly used network structure, which includes Siamese Cbow, Siamese CNN, Siamese LSTM and so on. In the training of twin network, all queries use the same model for semantic representation, and calculate the similarity among queries by means of cosine similarity, so as to continuously maximize the correlation between positive samples and suppress the correlation between negative samples. During the prediction, each Query obtains semantic vectors separately from the semantic model, which is used to calculate the similarity score among the Queries. Since the semantic representation of Query is only related to itself, the semantic index of Query in corpus can be constructed in advance during Query retrieval, greatly improving the retrieval efficiency of the system.

Compared with the semantic representation model, the interactive semantic model has better matching effect and more complex model structure. The commonly used interactive semantic models include ABCNN, ESim, etc. When the interactive model calculates the semantic similarity among Queries, it not only models the semantic features of a single Query, but also needs the interactive features among Queries. An interactive model is usually trained with a binary task where the two queries entered by the model are semantically identical and the label is “1”, otherwise, the label is “0”. When making predictions, you can use logits as a confidence judge.



The emergence of large-scale pre-training models has also swept the lists of text similarity tasks. Bert brought the SOTA of the LCQMC dataset to the 86% level. Subsequently, new pre-training models such as Roberta, Albert and Ernie emerged one after another, constantly refreshing the SOTA level of matching accuracy.

2. Business applications

A recall + sort algorithmic architecture is usually used in semantic question-and-answer business, and we use a similar architecture in our small talk business. Siamese CNN semantic representation model is used for semantic recall, and Transformer semantic interaction model after distillation is used for sorting.

In the loss construction of semantic representation model, we refer to the loss function design in the field of face recognition. The two tasks are similar in nature. Face recognition uses vector to represent face pictures, while text retrieval uses vector to represent text. Both of them expect a high enough correlation between positive samples and a good enough distinction between negative samples.

When using Siamese CNN for semantic modeling, we used 1 standard query, 1 positive sample and 5 negative samples (other negative sample numbers have been tried, but the effect is not as good as 5 negative samples in our data). In fact, the training process is to identify the positions of corresponding positive samples in these 6 samples. Therefore, it can be transformed into a classification task for training, and each positive and negative sample corresponds to a category respectively. The similarity between each sample and standard Query is used as the logits of the corresponding category. Logits is normalized and loss function is constructed. The classification boundary constructed by traditional Softmax normalization makes categories separable. In order to achieve better semantic representation effect, it is necessary to make more convergence within classes and more dispersion between classes. Normalized methods such as ASoftmax, AMSoftmax and ArcFace propose to map all Queries to a sphere, and the similarity between Queries can be calculated by the included Angle between them. The smaller the included Angle, the higher the similarity. By adding margin in the Angle domain, the convergence within the class is more and the division between classes is more. To achieve better semantic representation effect.

We compared different normalization methods such as Softmax, ASOFTMAX, AMSoftMax and ArcFace, among which Softmax did not add any margin, while ASOFTMAX added margin in the Angle field by doubling Angle. AmSoftmax adds margin in the cosine domain, while ArcFace directly adds fixed margin in the Angle domain.

We used 30W corpus to build the index, used 12900 online Queries (the corpus did not contain the exact same Queries) to carry out the recall test, and used the same vector index tool. By comparison, we found that the recall effect of AmSoftmax and ArcFace had been greatly improved, which had been applied in our business.

In terms of sorting model, we have tried ABCNN, ESim, Transformer and other interactive semantic models, but there is still a certain gap in the effect compared with Bert and other pre-training models. The pre-training model Xbert, developed by our team, is on the same scale as Roberta Large, with self-developed knowledge graph data, WWM (Whole Word MLM), DAE, Entity MLM and other tasks added, and optimized with Lamb optimizer. We tested this with Xbert on business data and found an accuracy improvement of nearly 0.9% compared to the Roberta Large of the same size. To meet the online requirements, we took the Tiny Bert approach and distilled a 4-tier Transformer Model with Xbert for online inference.

We compared the effect of different sorting schemes on the internal question-and-answer data set, and used 12,900 online users’ real queries to carry out the effect comparison test for the whole link. The accuracy of semantic recall Top1 was used to evaluate the effectiveness of semantic representation model, and the accuracy of response was further improved by disambiguation module. When testing the effect of the sorting model, we used multi-path recall to recall a total of 30 candidates. The sorting model was used to sort the candidates, and the top1 after sorting was selected as the final answer. If all candidates are disambiguated after the disambiguation module, or if the ranking score of Top1 candidate after sorting does not meet the reply threshold, the Query system will have no reply. Therefore, we use the response rate and response accuracy as the final evaluation index of the system to evaluate the effects of different schemes.

In order to test the effect of self-developed Xbert on the open semantic similarity dataset, the accuracy of single model on LCQMC dataset is 88.96%, which is 1% higher than the accuracy of Roberta Large single model, which is 87.9%. By using the transmissibility between positive samples and negative sample sampling to conduct data enhancement and FGM confrontation training, the accuracy rate was improved to 89.23%. Through Ensemble, the accuracy was further improved to 90.47%. In the same way, we reached 87.17% on the BQ_Corpus, 88% on the Paws-X task, and 77.234% on the AFQMC dataset. We also reached the top in the Qian-Words Text Similarty Competition held by Baidu.

3. Summary and outlook

Short-text similarity has been applied in our field of small talk. We use semantic representation learning to carry out the algorithm architecture of recall + interaction model ordering, and achieve good business results on the premise of ensuring the system performance. In the semantic representation model, loss in the field of face recognition is used to improve the recall effect. In terms of semantic sequencing, we also use large-scale pre-training models and model distillation to further improve business performance. In terms of large-scale pre-training language model, we actively explore and improve. Compared with the existing open source pre-training model, the evaluation effect of our Xbert on business and public data sets has been further improved.

In the future work, we will make good use of the nuclear weapon of the pre-training model, strive to optimize and break through on the basis of our Xbert, and bring the task of text similarity matching to a new level. In the case of solving single-round similarity matching, we will continue to explore tasks such as multi-round matching and multi-round generation combined with context to further improve our chatting business experience.