Although bert-based models have achieved success in many downstream tasks of NLP, sentence vector representations derived directly from BERT are often constrained in a very small area and show high similarity, so it is difficult to be directly used for text semantic matching. In order to solve the “collapse” phenomenon of BERT’s native sentence representation, the knowledge Graph team of Meituan NLP Center proposed a sentence representation transfer method based on contrast learning — ConSERT, which makes the sentence representation generated by the model more suitable to the data distribution of downstream tasks by fine-tune on the unsupervised corpus of the target domain. Experimental results in the sentence semantic matching (STS) task showed that ConSERT significantly improved 8% compared with the previous SOTA under the same setting, and still showed a strong performance improvement in the small sample scenario.
- ConSERT: A Contrastive Framework for Self-supervised Sentence Representation Transfer
- Conference: ACL 2021
- Download link: arxiv.org/abs/2105.11…
- Open source: github.com/yym6472/Con…
1. The background
Sentence vector representation learning plays an important role in the field of natural language processing (NLP), and the success of many NLP tasks depends on the training of high-quality sentence representation vectors. Especially on the tasks of Semantic Textual Similarity and Dense Text Retrieval, The model measures the semantic correlation of the two sentences by calculating the similarity of Embedding representation space after encoding, and then determines their matching score.
Although the Bert-based model achieves good performance on many NLP tasks (through supervised fine-tune), the sentence vector derived by the bert-based model itself (without fine-tune, all word vectors are averaged) is of low quality, even inferior to the Glove result. So it is difficult to reflect the semantic similarity of two sentences [1] [2] [3] [4] ^ \ text {[1] [2] [3] [4]} [1] [2] [3] [4]. In the process of research, we further analyzed the characteristics of sentence vectors derived by BERT and confirmed the following two points:
- BERT tends to encode all sentences into a small spatial region, which results in a high similarity score for most sentence pairs, even those that are semantically completely unrelated (as shown in Figure 1A). We refer to this as the “Collapse” phenomenon of BERT’s sentence.
- The collapse of vector representation of BERT sentence is related to the high frequency words in the sentence. Specifically, when the sentence vector is calculated by means of average word vector, the word vector of those high-frequency words will dominate the sentence vector, making it difficult to reflect its original meaning. When several high-frequency words are removed in the calculation of sentence vectors, the collapse phenomenon can be alleviated to some extent (as shown in the blue curve in FIG. 2).
Sentence vectors derived by BERT are difficult to be used directly for downstream semantic matching tasks, and supervised corpus for fine-tune is expensive. Therefore, we hope to find a self-supervised method, which only needs to collect a small amount of unlabeled text from downstream tasks for fine-tune, so as to solve the “collapse” problem of BERT sentence vector and make its representation more suitable for downstream tasks.
In this paper, we use Contrastive Learning to achieve the above purposes. Contrast learning is one of the self-supervised tasks widely used at present. Its core idea is that humans distinguish objects through “contrast”, so similar things should be similar in the coded representation space, and different things should be as far apart as possible. By applying different data enhancement methods to the same sample, we can obtain a series of “self-similar” text pairs as positive examples, and take other texts in the same Batch as negative examples, so as to regulate the BERT representation space as supervised signals. In the experiment, we found that contrast learning can effectively eliminate the interference of high-frequency words to the semantic representation of sentences (as shown in orange curve in FIG. 2). After contrast learning training, the sentence representation generated by the model will no longer be dominated by high-frequency words (the performance does not change significantly after the removal of the first few high-frequency words). This is because the learning goal of self-identification in contrast learning can naturally recognize and suppress such high-frequency features, so as to avoid semantically different sentences representing too similar (i.e., collapse phenomenon).
In contrast learning, we further analyze the influence of different data enhancement methods, and verify the performance of our method in the case of small samples. Experimental results show that even with a very limited amount of data (e.g. 1000 unlabeled samples), our method still shows strong robustness and can effectively solve the collapse problem of BERT representation space and improve indicators in downstream semantic matching tasks.
2. Research status and related work
2.1 Sentence representation learning
Sentence representation learning is a classic task, which is divided into the following three stages:
- Supervised Sentence representation learning Methods: The early work [5]^\text{[5]}[5] found that the Natural Language Inference (NLI) task was of great help to the semantic matching task. They used BiLSTM encoder to integrate two NLI data sets SNLI and MNLI for training. Universal Sentence Encoder [6]^\text{[6]}[6] (USE) uses a Transformer based architecture and uses SNLI to enhance unsupervised training. SBERT [1]^\text{[1]}[1] further encodes two sentences using a shared pre-trained BERT encoder, trained on NLI data sets (fine-tune).
- Pretraining: due to the high cost of annotating supervised data, researchers began to look for unsupervised training methods. BERT proposed the task of NSP as a self-supervised sentence level pre-training goal. Although later work pointed out that NSP was not much help compared to MLM. Cross-thought [7]^\text{[7]}[7] and CMLM [8]^\text{[8]}[8] are two similar pre-training targets. They cut a passage into several short sentences and then restore the Token masked in the current sentence by encoding adjacent sentences. Compared with MLM, the encoding of other sentences in the context is added to help Token recovery, so it is more suitable for sentence-level training. SLM [9]^\text{[9]}[9] performs self-supervised pre-training by disordering several previously coherent sentences (achieved by changing Position Id) and then predicting the correct sentence order.
- Unsupervised sentence representation transfer: The pre-training model has been widely used, but BERT’s NSP task has worse representation performance, and most students do not have the resources to conduct self-supervised pre-training, so it is more effective to transfer the representation of the pre-training model to the task. Bert-flow [2]^\text{[2]}[2] : Work of CMU& Byte AI Lab, by learning a reversible flow transform on BERT, it is possible to map BERT representation space to normalized standard Gaussian space, and then perform similarity matching in Gaussian space. Bert-broke [10]^\text{[10]}[10] : Su Jianlin worked on the same job as us. They proposed that whitening the BERT representation (mean to 0, covariance to the identity matrix) could achieve an effect comparable to bert-flow on STS. SimCSE [11]^\ TEXT {[11]}[11] : After we submitted the ACL in February, we saw the work of Danqi Chen group published in April 2021. They also use a training framework based on contrast learning, data enhancement methods of Dropout, and fine-tune BERT on the Wikipedia corpus.
2.2 Comparative learning
Contrastive learning is a pre-training method that has emerged in the CV field since the end of 2019, and has also been widely used in NLP tasks recently. We briefly introduce the progress in two areas:
- Comparative Learning in the Field of Computer Vision (CV) : From the end of 2019 to the beginning of 2020, Facebook proposed MoCo [14]^\text{[14]}[14] and Google proposed SimCLR [15]^\text{[15]}[15]. Since then, comparative learning began to shine in the field of unsupervised image representation pre-training. SimCLR proposed a simple contrast learning framework, which enhanced the same image to get two different versions, then encoded the image through ResNet, and mapped it to the contrast learning space using a mapping layer, using NT-Xent loss for pre-training. The framework of this paper is also mainly inspired by SimCLR.
- Contrastive learning in NLP (for text representation learning) : With the great success of contrastive learning in CV unsupervised image representation pretraining tasks, many efforts have been made to introduce contrastive learning into NLP language model pretraining. Here are some representative works and their summary:
The name of the | structure | Data enhancement method | Action stage |
---|---|---|---|
BERT-CT |
Generate two views from two models of the same structure with different parameters | Generate two views from two models of the same structure with different parameters | Fine-tuning BERT |
IS-BERT |
Add an additional CNN layer on top of BERT | Global Embedding as a View; Local Embedding of CNN layer acts as another View | Fine tuning of BERT to maximize mutual information of global and local Embedding, similar to DeepInfoMax (DIM) |
CERT |
Similar to MoCo, with a momentum encoder | Back translation | Contrast unsupervised fine tuning on tasks first, and then move to supervised tasks |
CLEAR |
Similar SimCLR | Deletion of tokens and spans, Reordering of positions, Substitution | During pre-training, MLM and contrast loss combined training model were used |
DeCLUTR |
Similar SimCLR | Extract the Span from the text | During pre-training, MLM and contrast loss combined training model were used |
3. Model introduction
3.1 Problem Definition
Given a Bert-like pre-trained language model M\ Textbf {M}M, and an unlabeled text corpus D\mathcal{D}D collected from the target domain data distribution, We hope that by constructing a self-supervised task fine-tune M\ mathcal{D} M\textbf{M}M on D\mathcal{D}D, the post-fine-tune model will perform best on the target task (text semantic matching).
3.2 Sentence representation transfer framework based on contrast learning
As shown in Figure 3, inspired by SimCLR, we improved BERT encoder and proposed ConSERT, which mainly consists of three parts:
- A data enhancement module (see below) that works on Embedding to generate two different enhanced views for the same sentence.
- A shared BERT encoder that generates sentence vectors for input sentences.
- A comparison loss layer is used to calculate the comparison loss in a Batch of samples. The idea is to maximize the similarity of sentence vectors of different enhanced versions of the same sample, while keeping sentence vectors of different samples away from each other.
During the training, a Batch text is sampled from dataset D\mathcal{D}D, and the Batch size is set to NNN. Through the data enhancement module, two versions of each sample are generated by two preset data enhancement methods, resulting in a total of 2N2N2N samples. The 2N2N2N samples are all encoded by the shared BERT encoder, and then 2N2N2N sentence vectors are obtained through an average pooling layer. We fine-tune the model with nT-Xent losses consistent with SimCLR:
Sim ()\text{sim}()sim() function is cosine similarity function; RRR represents the corresponding sentence vector; Tau \ tau tau said temperature, it is a super parameters, experiment for 0.1. Intuitively, this loss means that each sample in the Batch finds another enhanced version corresponding to it, while the other 2N−22N-22N−2 samples in the Batch act as negative samples. The result of optimization is to make the two enhanced versions of the same sample have as much consistency in the representation space as possible, while being as far apart as possible from other negative samples within the Batch.
3.3 Exploration of data enhancement methods for text field
The image field can easily transform the sample, such as rotation, flip, cropping, color removal, blur, and so on, so as to get the corresponding enhanced version. However, due to the natural complexity of languages, it is difficult to find efficient ways to enhance data while preserving semantics. Some ways to explicitly generate enhanced samples include:
- Translation: Translation of text into another language and back using machine translation models.
- CBERT [12][13]^\text{[12][13]}[12][13] : Replace part of the words in the text with [MASK], and then use BERT to restore the corresponding words to generate enhanced sentences.
- Paraphrase: Use the training of good Paraphrase generation model to generate synonymous sentences.
However, on the one hand, these methods may not guarantee semantic consistency; on the other hand, every data enhancement requires model Inference, which will cost a lot. In view of this, we consider the method of implicitly generating enhanced samples in Embedding, as shown in Figure 4:
- Adversarial Attack: This method generates Adversarial disturbance through gradient backpropagation and adds the disturbance to the original Embedding matrix to get the enhanced sample. Since the generation of counterdisturbance requires gradient backpropagation, this data enhancement method is only applicable to supervised training scenarios.
- Token Shuffling: This method scrambles the word order of the input sample. Since Transformer structure does not have the concept of “location”, the perception of Token location of the model depends on Position Ids in Embedding. Therefore, we only need to Shuffle the Position Ids.
- Cutoff: Can be further divided into two types:
- Token Cutoff: selects the Token randomly and sets the whole line of the corresponding Token to zero.
- Feature Cutoff: randomly select the Feature of Embedding and set the selected Feature dimension to zero.
- Dropout: Every element in Embedding is set to zero with a certain probability. Unlike Cutoff, this method has no constraint of line or column.
The four methods can be easily modified by Embedding matrix (or BERT’s Position Encoding), so they are more efficient than the method of explicitly generating enhanced text.
3.4 Further fusion of supervisory signals
In addition to unsupervised training, we also propose several strategies to further integrate supervised signals:
- Joint Training (joint) : Supervised loss and unsupervised loss through a weighted joint training model.
- Supervision before supervision (sup-unsup) : The supervised loss training model is used first, and then the unsupervised method is used for representation migration.
- Joint training without supervision (joint-unsup) : The joint loss training model is used first, and then the unsupervised method is used for representation migration.
4. Experimental analysis
We mainly experimented on the Semantic Textual Similarity (STS) task, including seven data sets: STS12, STS13, STS14, STS15, STS16, STSb, sick-R. Sts12-16 is the data set released by semeval2012-2016 evaluation competition. STSb is STS Benchmark, from SemEval2017 evaluation competition. Sick-r stands for sick-relatedness, which is a subtask of the SICK (Sentences Involving positional Knowledge) data set whose goal is to infer semantic correlation of two sentence times (i.e. Relatedness). The samples in these data sets all contain two short texts Text1 and Text2, as well as artificially marked scores between 0 and 5, which represent the semantic matching degree of Text1 and Text2 (5 represents the most matching, that is, “two sentences express the same meaning”; 0 represents the least match, meaning “the semantics of the two sentences are completely unrelated”). Here are two examples:
text1 | text2 | Semantic similarity of manual annotation |
---|---|---|
a man is cutting paper with a sword . | a woman is cutting a tomato . | 0.6 |
a woman is dancing in the rain . | a woman dances in the rain out side . | 5.0 |
In the test, we selected Spearman correlation as an evaluation index according to previous work [1][2]^ text{[1][2]}[1][2], which is used to measure the correlation between two sets of values (cosine similarity predicted by model and semantic similarity artificially annotated). The result will be between [-1, 1], and only up to 1 if the two sets of values are completely positively correlated. For each dataset, we converged all of its test samples to calculate this metric and reported the average results of seven datasets. For brevity, 100 times the result is reported in a table.
4.1 Unsupervised experiment
In the unsupervised experiment, we performed fine-tune on unlabeled STS data directly based on pre-trained BERT. The results show that our method greatly outperforms the previous SOTA-bert-flow method with a relative performance improvement of 8% under completely consistent Settings.
4.2 Supervised experiments
In supervised experiments, we used additional training data from SNLI and MNLI, and conducted experiments using the three methods mentioned above for the fusion of additional supervised signals. The experimental results showed that our method exceeded the baseline under the two experimental Settings of “using NLI with labeled data only” and “using NLI with labeled data + STS without labeled data”. In the three experimental Settings of fused supervisory signals, we find that the joint-unsup method achieves the best results.
4.3 Analysis of different data enhancement methods
We performed ablation analysis on different data enhancement combination methods, and the results are shown in Figure 7. We find that the combination of Token Shuffle and Feature Cutoff achieves the optimal performance (72.74). In addition, for a single data enhancement method, Token Shuffle > Token Cutoff >> Feature Cutoff ≈ Dropout >> None.
4.4 Experimental analysis under small sample setting
We further analyzed the influence of data volume (the number of unlabeled texts) on the effect, and the results are shown in Figure 8. The results show that our method needs only a few samples to achieve approximate full data volume effect; At the same time, it still shows a good performance improvement over the Baseline in a small sample size (e.g., 100 texts).
Experimental analysis of 4.5 Temperature overparameter
In experiments, we find that the temperature hyperparameter (τ\tauτ) in the comparative learning loss function has a great influence on the results. As can be seen from the analytical experiment in FIG. 9, the optimum result can be obtained when τ\tauτ value is between 0.08 and 0.12. This once again proved that the BERT said the collapse of problem, because the sentence is very close to the situation said, tau \ tau tau will make sentence similarity between more smooth, encoder is very difficult to learn knowledge. And tau \ tau tau if too small, the task is too simple, so need to be adjusted to a suitable range.
4.6 Experimental analysis of Batch size overparameter
In the comparative learning in the image field, Batch size will have a great impact on the results, so we also compared the performance of models under different Batch sizes. It can be seen from Figure 10 that the two are basically proportional, but the improvement is very limited.
5. To summarize
In this work, we analyze the reason of BERT sentence vector representation space collapse, and propose a sentence representation transfer framework ConSERT based on contrastive learning. ConSERT showed good performance in unsupervised fine-tune and further fusion of supervised signals. At the same time, when the number of samples collected is small, it still has a good performance improvement, showing strong robustness.
At the same time, in the business scenarios of Meituan, there are a large number of short text correlation calculation requirements in different fields. At present, ConSERT has been used in knowledge graph construction, KBQA, search recall and other business scenarios. In the future, more businesses of Meituan will be explored and settled. At present, the relevant code has been open source on GitHub, welcome to use.
reference
- [1] Reimers, Nils, and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
- [2] Li, Bohan, et al. “On the Sentence Embeddings from Pre-trained Language Models.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
- [3] Gao, Jun, et al. “Representation Degeneration Problem in Training Natural Language Generation Models.” International Conference on Learning Representations. 2018.
- [4] Wang, Lingxiao, et al. “Improving Neural Language Generation with Spectrum Control.” International Conference on Learning Representations. 2019.
- [5] Conneau, Alexis, et al. “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.
- [6] Cer, Daniel, et al. “Universal Sentence Encoder for English.” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018.
- [7] Wang, Shuohang, et al. “Cross-Thought for Sentence Encoder Pre-training.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
- [8] Yang, Ziyi, et al. “Universal Sentence Representation Learning with Conditional Masked Language Model.” arXiv preprint ArXiv: 2012.14388 (2020).
- [9] Lee, Haejun, et al. “SLM: Learning a Discourse Language Representation with Sentence Unshuffling.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
- [10] Su, Jianlin, et al. “Whitening sentence representations for better semantics and faster retrieval.” arXiv preprint arXiv:2103.15316 (2021).
- [11] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv preprint arXiv:2104.08821 (2021).
- [12] Wu, Xing, et al. “Conditional bert contextual augmentation.” International Conference on Computational Science. Springer, Cham, 2019.
- [13] Zhou, Wangchunshu, et al. “BERT-based lexical substitution.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
- [14] He, Kaiming, et al. “Momentum contrast for unsupervised visual representation learning.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
- [15] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” International conference on machine learning. PMLR, 2020.
- [16] Zhang, Yan, et al. “An Unsupervised Sentence Embedding Method by Mutual Information Maximization.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
- [17] Fang, Hongchao, et al. “Cert: Contrastive self-supervised learning for language understanding.” arXiv preprint arXiv:2005.12766 (2020).
- [18] Carlsson, Fredrik, et al. “Semantic re-tuning with contrastive tension.” International Conference on Learning Representations. 2021.
- [19] Giorgi, John M., et al. “Declutr: “Deep contrastive learning for unsupervised textual representations.” arXiv Preprint arXiv:2006.03659 (2020).
- [20] Wu, Zhuofeng, et al. “CLEAR: Contrastive Learning for Sentence Representation.” arXiv Preprint arXiv:2012.15466(2020)
Author’s brief introduction
- Yuan Meng, Ru Mei, Si Rui, Fu Zheng, Wu Wei, meituan platform/Search and NLP department.
- Weiran Xu, Associate professor, Doctoral supervisor, Pattern Recognition Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications.
Read more technical articles from meituan’s technical team
Front end | | algorithm back-end | | | data security operations | iOS | Android | test
| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.
| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.