This paper introduces mME-CRS, the champion method of DSTC10 open field dialogue evaluation track, which designs a variety of evaluation indexes and integrates the scores of different indexes by using correlation re-normalization algorithm, providing a reference for designing more effective evaluation indexes in the field of dialogue evaluation. The related methods were simultaneously published in AAAI 2022 Workshop. Hope to be engaged in the technical field of the students some inspiration or help.
1 background
The Dialog System Technology Challenge (DSTC) was launched by scientists from Microsoft and Carnegie Mellon University in 2013. It aims to promote The improvement of dialogue Technology in The academic and industrial circles, and has high authority and popularity in The field of dialogue. The Dialogue Systems Challenge, now in its 10th year (DSTC10), has attracted world-renowned companies, top universities and institutions such as Microsoft, Amazon, Carnegie Mellon University, Facebook, Mitsubishi Electronics Research Laboratories, Meituan and Baidu.
DSTC10 contains 5 tracks in total, and each Track contains several sub-tasks in a certain dialogue field. Among them, Track5 Task1 Automatic open-domain Dialogue Evaluation systematically and comprehensively introduces the Automatic Evaluation task of Open domain Dialogue into DSTC10 competition. Automatic evaluation of open domain dialogue is an important part of dialogue system, which is dedicated to automatically give the result of dialogue quality evaluation that conforms to human intuition. Compared with manual annotation, which is slow and costly, automated evaluation method can score different dialogue systems efficiently and at low cost, which greatly promotes the development of dialogue systems.
Unlike task-based dialogue, which has a fixed optimization goal, open domain dialogue is closer to real human dialogue and more difficult to evaluate, so it has attracted extensive attention. The DSTC10 Track5 Task1 contest consists of 14 validation datasets (a total of 37 different dialogue evaluation dimensions) and 5 test datasets (a total of 11 evaluation dimensions). Meituan voice team finally won the first place in the competition with an average relevance of 0.3104. This part of the work has been completed in a paper MME-CRS: Multi-metric Evaluation Based on Correlation re-scaling for Evaluating open-domain Dialogue, presented in AAAI2022 Workshop.
2 Introduction to the questions
The Open Domain Dialogue Evaluation Competition collected classic datasets from papers in the dialogue domain, including 14 validation datasets (12 turn-level Level datasets and 2 dialog-level datasets) and 5 test datasets.
Each conversation in the dataset mainly contains the following information:
- Context: the question in the conversation, or the Context of the conversation.
- Response: the Response to the Context, that is, the specific object of the evaluation; Response in a conversation dataset is generally generated by different conversation generation models, such as GPT-2 and T5.
- Reference: The number of manual Context Reference answers is about five.
Each dialogue contains multiple evaluation dimensions, such as the correlation between Context and Response, and the fluency of Response itself. Each dataset has different evaluation dimensions. The 14 validation sets contain a total of 37 different evaluation dimensions, including Overall, Grammar, Relevance, Appropriateness and Interesting. Each evaluation dimension has a manually annotated score on a scale of 1 to 5, with a higher score indicating a higher quality of the current evaluation dimension.
Statistics for validation sets and test sets are shown in Figures 2 and 3:
Turns represents the number of conversation rounds in the corresponding data set. The Qualities represent the evaluation dimension of each dialogue in the data set, and each evaluation dimension has corresponding manual marking score. Annos represents the amount of annotations in each dataset.
In this competition, each data set, each dialogue and each evaluation dimension are artificially marked with scores ranging from 1 to 5. The average value is generally calculated for correlation calculation. Teams were asked to design evaluation indicators that predicted the scores of different evaluation dimensions for each dialogue. The predicted score of each evaluation dimension of each data set will be Spearman correlated with the artificially marked score. The final match result will be averaged based on the evaluation dimension of all test data sets.
3. Existing methods and problems
3.1 Existing methods
Automatic evaluation methods for open domain dialogues fall into three main categories.
Overlap – -based method
Early researchers compared Reference and Response in the dialogue system to original sentences and translated sentences in machine translation, and used the evaluation indicators of machine translation to evaluate the quality of dialogue. Overlap-based method calculates the word Overlap between Response and Reference in dialogue, and the higher the word Overlap, the higher the score. Classic methods include BLEU[1] and ROUGE[2], where BLEU measures quality based on accuracy, while ROUGE measures quality based on recall. The evaluation of Response depends on the given Reference, while the appropriate Response in the open domain is infinite. Therefore, overlap-based method is not applicable to the evaluation of open domain dialogue.
Embedding – -based method
With the rapid development of word vector and pretrained language model, Embedding-based assessment method has achieved good performance. Response and Reference were coded respectively based on the depth model, and correlation scores were calculated based on their codes. The main methods include Greedy Matching[3], Embedding Averaging[4], and BERTScore[5-6]. The Embedding-based method has a great improvement compared with overlap-based method, but also depends on Reference, so there is still a large space for optimization.
Learning – -based method
Reference-based open domain dialogue evaluation has a one-to-many dilemma [7] : that is, the appropriate Response of open domain dialogue is infinite, but the artificially designed Reference is limited (generally about 5). Therefore, there are great limitations in designing open domain evaluation methods based on the comparison of similarity (literal overlap or semantic similarity) between Reference and Response. Compared with the existing overlap-based and Embedding-based methods, ADEM method [8] first uses hierarchical encoder to encode Context and Reference, and grades the input Response. The ADEM method optimises model parameters based on the mean square error of model scoring and manual scoring, and is expected to approximate human scoring. ADEM model has achieved great success compared with overlap-based method and Embedding-based method, and learning-based method has gradually become the mainstream method of automatic evaluation in open field.
In order to improve the accuracy and comprehensiveness of dialogue evaluation, various evaluation dimensions emerge one after another. In order to cope with the challenges brought by the increasing number of evaluation dimensions, USL-H[9] divides evaluation dimensions into Understandability, Sensibleness and Likeability, as shown in Figure 4. Usl-h specifically proposed three indicators, VUP (Valid Utterance Prediction), NUP (Next Utterance Prediction) and MLM (Mask Language Model), to measure dialogue respectively:
- Response is smooth and smooth.
- How much Context is related to Respose.
- Whether the Response itself is detailed, more like human, etc.
3.2 the problem
The existing evaluation methods mainly have the following problems:
The designed dialogue indicators are not comprehensive enough to comprehensively measure the quality of dialogue
Existing automatic evaluation methods mainly focus on some evaluation dimensions of individual data sets. Taking the current comprehensive USL-H as an example, this method considers the smoothness and richness of Response and the correlation of context-response sentence pairs, but USL-H ignores:
- More fine-grained context-response sentence pairs for thematic consistency.
- The respondent’s participation in the current conversation.
Experiments show that the omission of these indexes seriously affects the performance of evaluation methods. In order to evaluate multiple dialogue data sets more comprehensively and stably, it is imperative to design indicators that consider more evaluation dimensions.
Lack of effective index integration methods
Most existing methods tend to design one evaluation indicator for each evaluation dimension, which is inadequate in the face of more and more evaluation dimensions (consider that the match test set contains 37 different evaluation dimensions). The evaluation of each dialogue dimension may depend on several evaluation indicators. For example, the Logical evaluation dimension requires dialogue: 1) Smooth Response; 2) Response and Context are related. By designing basic evaluation sub-indicators and integrating scores of multiple sub-indicators with appropriate integration methods, different evaluation dimensions of dialogue can be represented more comprehensively and effectively.
4 Our method
In view of the lack of comprehensive Evaluation indicators, this paper designed 7 multi-metric Evaluation indicators (MME) in 5 categories to comprehensively measure the quality of dialogue. Based on the designed 7 basic indicators of 5 categories, we further proposed the Correlation re-scaling Method (CRS) to integrate the scores of different evaluation indicators. The proposed model is called MME-CRS, and the overall architecture of the model is shown in Figure 5:
4.1 Basic Indicators
In order to solve the first problem of existing methods, that is, the designed dialogue indicators are not comprehensive enough, we designed 7 evaluation sub-indicators in 5 categories in the competition.
4.1.1 Fluency Metric (FM)
Objective: To analyze whether the Response itself is fluent enough to be understood.
Content: Firstly, the response fluency dataset is constructed based on Dailydialog dataset [10], and the process is as follows:
- A Response is randomly selected in the Dailydialog data set and r is determined to be a positive or negative sample with a probability of 0.5.
- If sample R is a positive sample, a random adjustment is selected: a. No adjustment; B. Delete each stop word with a probability of 0.5.
- If sample R is negative sample, randomly select an adjustment: a. Randomly scramble word order; B. Delete a certain proportion of words randomly; C. Select some words at random and repeat.
After the fluency data set is constructed based on the above rules, the pre-training model SimCSE model [11] is fine-tuned. The fine-tuned model can calculate the Response fluency score of any conversation, which is recorded as FM score.
4.1.2 Relevance Metric (RM)
Objective: To analyze the correlation between Context and Response.
Content: Construct context-response sentence pair correlation data set based on Dailydialog data set, where sentence pair correlation is positive sample, non-correlation is negative sample. The usual construction idea for negative samples is to replace Response randomly with Response of other conversations. PONE method [12] points out that randomly selected Respose and Context are basically unrelated, and the model training benefits are small. Therefore, the method here is to randomly select 10 responses, calculate the semantic relevance between them and the real Response, and select the middle sentences as pseudo-samples. After the data set is constructed, the SimCSE model is fine-tuned. The fine-tuned model can be used to calculate the correlation score of Context and Response in the dialogue, which is marked as RM score.
4.1.3 Topic Coherence Metric (TCM)
Objective: To analyze the topic consistency of Context and Response.
Content: The GRADE method [13] constructs the thematic level graph representation of Context and Response, and calculates the thematic level correlation of Context and Response. Compared with coarse-grained correlation indicators, GRADE pays more attention to the degree of topic correlation at fine-grained level, which is an effective supplement to correlation indicators. TCM indicators refer to GRADE method.
The specific process is as follows: Firstly, keywords in Context and Response are extracted to construct the graph, where each keyword is a node and only the keywords of Context and Response have edges. ConceptNet is used to obtain the representation of each node, then graph attention network (GATs) is used to gather the information of keyword neighbor nodes and iterate the representation of each node. Finally, the representation of all nodes is integrated to obtain the graph representation of the dialogue. The full-connection layer is connected to the graph representation at the thematic level for classification, and the fine-tuned model can be used to calculate TCM scores for conversations.
4.1.4 Engagement Metric (EM)
Purpose: Analyze how willing the person or conversation model generating the Response is to participate in the current conversation.
Content: The metrics mentioned above all measure dialogue quality from the perspective of Context and Response, while user engagement is measured from the perspective of the user. The user engagement score is typically 0 and 5, with the higher the score, the more interested the user is in participating in the current conversation. We scaled the participation score of ConvAI dataset [10] from 1 5 to 0~1 as the participation score dataset. The pre-training model still uses SimCSE, which is used to predict engagement ratings for conversations. The pre-trained model can be used to predict the user engagement score of the conversation, denoised as EM.
4.1.5 Specificity Metric (SM)
Objective: To analyze whether the Response itself is detailed enough.
Content: SM index is used to avoid ambiguity and lack of information in Response.
The specific method is as follows: Mask every Token in Response in sequence, and Negative log-likelihood loss is calculated based on MLM task of SimCSE model, and the score obtained is called SM-NLL. The replacement loss function is Negative cross-entropy and The Perplexity can obtain SM-NCE and SM-PPL scores respectively, which are altogether three SM index scores. The scores of the three SM indicators need to be normalized to between 0 and 1 respectively.
4.2 Integration method CRS
Integrating scores of different evaluation indicators is an effective means to improve the effectiveness of automated dialogue evaluation.
For each conversation to be evaluated, there are seven different scores based on the five categories and seven basic indicators mentioned above. For an evaluation dimension of the data set to be evaluated, a comprehensive score is obtained by integrating 7 indicators, which is used to calculate correlation with human score. Our integration approach is divided into two steps.
4.2.1 Calculation of weight distribution of different evaluation dimensions
Firstly, the correlation score of 7 evaluation indexes in each evaluation dimension of each data set in the validation set is calculated. The higher the correlation score is, the more important the index is to the evaluation dimension. The more important evaluation indexes are given a larger weight, and the weight obtained is re-normalized in the index dimension, so as to obtain the weight distribution of different evaluation indexes in each evaluation dimension of each data set:
SijkS_{iJK}Sijk is the correlation score of the KKK evaluation index on the JJJ evaluation dimension of the third dataset, and DIJD_ {ij} DIj is the power of correlation score. The larger DIJD_ {ij} DIj is, the greater the weight of the index with higher correlation score will be. In general, the integration is best when Max (SijkdijS_{ijk}^{d_{ij}}Sijkdij) is between 1/3 and 1/2, which is a simple and effective means to calculate DIJD_ {ij} DIj. In the experiment, dijd_{ij}dij is set as a constant to get better generalization effect, and diJD_ {ij}dij is set as 2, and the weight distribution is calculated on the verification set, and then migrated to the test set, and the race optimal performance is achieved.
In the dimension of data set, the weights of the same evaluation dimension in different data sets were averaged to obtain the weight distribution of each evaluation dimension on different evaluation indicators:
Note that the weight distribution obtained here is no longer relevant to the specific data set, so you can migrate the weight distribution to the test set.
4.2.2 Calculate the weighted sum of index scores
For each evaluation dimension of each test set, the scores of 7 indicators are calculated and the weighted sum is calculated based on the weights of the first step to obtain the comprehensive score:
The correlation between the weighted composite score and the manual score was calculated to obtain the correlation score between the model score and the manual score on each evaluation dimension.
Our integration method is weighted and re-normalized based on the correlation score of indicators, so this integration method is called correlation re-normalized method. Mme-crs evaluation algorithm can be obtained by using CRS integration method on the obtained MME index.
5 Experimental Analysis
5.1 Experimental Results
Our method was mainly based on the pre-training of Dailydialog data set (except that the EM sub-index used ConvAI2 data set), and the weight distribution of the integration method was calculated on the match verification set. Finally, Spearman correlation score of 0.3104 was obtained on the test set.
FIG. 6 shows the benchmark model Deep am-fm [14] and the performance of the evaluation dimensions of the Top5 teams in different data sets in the test set. The proposed method achieved the first place with an average Spearman correlation coefficient of 0.3104, and achieved the first place in 6 out of all 11 evaluation dimensions in 5 data sets, which proved the superior performance of the proposed method.
In order to facilitate the presentation, the method in the figure adopts the presentation mode of data set – evaluation dimension. In which J, E, N, DT and DP stand for JSALT, ESL, NCM, DST10-Persona data sets respectively. A, C, G, and R represent Appropriateness, Content, Grammar, and Relevance, respectively. We have highlighted the best performance on each evaluation dimension.
5.2 Ablation experiment
In the ablation experiment, we used the MME-CRS evaluation as the benchmark, and removed FM, RM, TCM, EM, SM, RM+TCM indexes in the integration stage, and compared the importance of different indexes in the integration process. Experimental performance is shown in Figure 7:
Correlation indicator RM and topic consistency indicator TCM both use Context and Response information in dialogues. Therefore, remove these two indicators in the experiment to observe the impact on performance. As can be seen from the experimental results in FIG. 7:
- TCM, RM and EM contributed the most to the model performance, and the average Spearman correlation score on the test set decreased by 3.26%, 1.56% and 1.01% respectively after the deletion of these three evaluation indexes in the scoring integration stage.
- Coarse-grained RM index and fine-grained TCM index are beneficial to complement each other. If the RM and TCM indicators are removed, the performance will be slightly degraded. If both RM and TCM indicators are removed, the evaluation method lacks context-related information, and the performance decreases to 11.07%.
- The improvement of SM index in the test set can be ignored. We analyzed the reason as follows: Each generation model used to generate Response in the test set was severely over-fitted on the corpus of the test set, so a lot of very detailed Response unrelated to Context was generated. Therefore, the quality of SM index has no effect on the evaluation of test set quality.
5.3 effect of CRS
In order to analyze the role of integration algorithm CRS, this paper compares the performance of two evaluation methods, MME-CRS and MME-AVG (scoring multiple INDICATORS of MME by simple average), as shown in FIG. 8:
As can be seen from the figure, mME-CRS method is 3.49% higher than MME-AVG, which proves the superior performance of CRS algorithm in integrating sub-index scoring.
6 summarizes
In this competition, we summarized two main problems existing in open domain dialogue automatic evaluation, namely, incomplete evaluation indicators and lack of effective indicator integration methods. Aiming at the problem that the evaluation indexes are not comprehensive enough, this paper designs 7 kinds of evaluation indexes in 5 categories to measure the quality of dialogue comprehensively. Based on seven basic indicators, a correlation re-normalization method is proposed to calculate the integration score of each dialogue evaluation dimension.
Although the method in this paper has achieved good results in the DSTC10 competition, we will continue to explore other more effective evaluation indicators and index integration methods in the future. We are trying to apply the technologies in the competition to meituan’s specific businesses, such as the intelligent outbound robot, intelligent marketing and intelligent customer service in the voice interaction center. We are evaluating the dialogue quality between machines and human customer service and users from different dimensions, constantly optimizing the dialogue effect and improving user satisfaction.
reference
[1] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318.
[2] Lin C Y. Rouge: A package for automatic evaluation of summaries[C]//Text summarization branches out. 2004: 74-81.
[3] Rus, V.; and Lintean, M. 2012. An optimal assessment of natural language student input using word-to-word similarity metrics. In International Conference on Intelligent Tutoring Systems, 675–676. Springer.
[4] Wieting, J.; Bansal, M.; Gimpel, K.; and Livescu, K. 2016. Towards universal paraphrastic sentence embeddings. In 4th International Conference on Learning Representations.
[5] Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2019. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations.
[6] Liu C W, Lowe R, Serban I V, et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 2122-2132.
[7] Zhao, T.; Zhao, R.; and Eskenazi, M. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long cca shut), 654-664.
[8] Lowe R, Noseworthy M, Serban I V, et al. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017: 1116-1126.
[9] Phy, V.; Zhao, Y.; and Aizawa, A. 2020. Deconstruct to reconstruct a configurable evaluation metric for open-domain dialogue systems. In Proceedings of The 28th International Conference on Computational Linguistics, 4164 — 4178.
[10] Zhao, T.; Lala, D.; and Kawahara, T. 2020. Designing precise and robust dialogue response evaluators. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 26 — 33.
[11] Gao T, Yao X, Chen D. SimCSE: Simple Contrastive Learning of Sentence Embeddings[J]. arXiv preprint arXiv:2104.08821, 2021.
[12] Lan, T.; Mao, X.-L.; Wei, W.; Gao, X.; and Huang, H. 2020. Pone: A novel automatic evaluation metric for open-domain generative dialogue systems. ACM Transactions on Information Systems (TOIS), 39 (1) : 1-37.
[13] Huang, L.; Ye, Z.; Qin, J.; Lin, L.; and Liang, X. 2020. Grade: Automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9230 — 9240.
[14] Zhang, C.; D Haro, l. f.; Banchs, R. E.; Friedrichs, T.; and Li, H. 2021. Deep AM-FM: In Conversational dialogue Systems for the Next Decade, 53 — 69. Springer.
Author’s brief introduction
Peng Fei, Xiao Hui, Kai Dong, Wang Jian, Chunyang, etc., are all engineers of Meituan Platform/voice interaction Department.
Read more technical articles from meituan’s technical team
Front end | | algorithm back-end | | | data security operations | iOS | Android | test
| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.
| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.