BLEU and ROUGE are two commonly used evaluation indexes in machine translation tasks. BLEU measures the quality of translation based on Precision, while ROUGE measures the quality of translation based on Recall.
1. Evaluation indicators of machine translation
After the translation of text is generated using machine learning methods, the performance of model translation needs to be evaluated, which requires some machine translation evaluation indicators, among which BLEU and ROUGE are the most common ones. These two indicators have older histories: BLEU was introduced in 2002, while ROUGE was introduced in 2003. Although these two indicators have some problems, they are still relatively mainstream evaluation indicators of MACHINE translation.
Generally, C is used to represent the translation of machine translation. In addition, m references are required for translation S1, S2… , Sm. Evaluation indicators can be used to measure machine translation C and reference translation S1, S2… And the matching degree of Sm.
2.BLEU
The full name of BLEU is Bilingual evaluation understudy. BLEU score ranges from 0 to 1. The closer the score is to 1, the higher the quality of translation. BLEU is mainly based on Precision. The following is the overall formula of BLEU.
- BLEU needs to calculate translation 1-gram, 2-gram… , the accuracy of n-gram. Generally, N is set to 4. Pn in the formula refers to the accuracy of N-gram.
- Wn refers to the weight of n-gram, generally set as uniform weight, that is, for any n, Wn = 1/ n.
- BP is the penalty factor. If the length of the translation is less than the shortest reference, BP is less than 1.
- BLEU’s 1-gram accuracy indicates how faithful the translation is to the original, while other N-grams indicate how smooth the translation is.
2.1 N-gram accuracy calculation
Suppose the machine translated translation C and a reference translation S1 are as follows:
C: a cat is on the table
S1: there is a cat on the table
Copy the code
Then 1-gram, 2-gram… The accuracy of
There are some problems with calculating Precision directly like this, for example:
C: there there there there there
S1: there is a cat on the table
Copy the code
In this case, the machine translation result is obviously incorrect, but its 1-gram Precision is 1, so BLEU usually uses the correction method. Given the translation S1, S2… Sm can calculate the Precision of n-tuples in C by the following formula:
2.2 Penalty factor
The method of BLEU calculating n-gram accuracy is introduced above, but there are still some problems. When the length of machine translation is short, BLEU score will be high, but this translation will lose a lot of information, for example:
C: a cat
S1: there is a cat on the table
Copy the code
So you have to multiply the penalty factor by the BLEU score
3.ROUGE
The full name of the ROUGE indicator is Recall Oriented Understudy for Gisting Evaluation, which is mainly based on Recall. Chin-yew Lin proposed four ROUGE methods in his paper:
- Rouge-n: The recall rate is calculated on n-gram
- Rouge-l: Takes into account the longest common subsequence between the machine translation and the reference translation
- Rouge-w: Improved rouge-l to calculate the longest common subsequence using a weighted method
- Rouge-s: Statistics are also performed on n-grams, but the n-gram used allows for “Skip” words that do not need to occur consecutively
3.1 ROUGE – N
Rouge-n is used to calculate the recall rate of N-gram. For N-gram, the rouge-n score can be calculated as follows:
The denominator of the formula is the number of N-grams counted in the reference text, while the numerator is the number of N-grams shared by the statistical reference text and the machine translation.
C: a cat is on the table
S1: there is a cat on the table
Copy the code
The roug-1 and roug-2 scores for the above example are as follows:
Chin-yew Lin also gave a method to calculate if there are M translations, S1… , SM. The rouge-n score for both machine translations and these translations is calculated and maximized as follows. This method can also be used with rouge-l, rouge-w, and rouge-s.
3.2 ROUGE – L
The L of rouge-l refers to the longest common subsequence (LCS), and the formula for calculating rouge-l is as follows:
R_LCS in the formula is the recall rate, P_LCS is the accuracy rate, and F_LCS is rouge-L. Beta is generally set to a large number, so F_LCS considers almost only R_LCS (recall rate). Notice that beta is large here, so F pays more attention to R than to P, and you can look at the formula below. If beta is large, the P_LCS term is negligible.
ROUGE – 3.3 W
Rouge-w is an improved version of rouge-l. Consider the following example, where X represents a reference translation and Y1, Y2 represents two machine translations.
In this example, it is clear that Y1 has a higher quality of translation because Y1 has more consecutive matching translations. However, using rouge-l gives the same score: rouge-l (X, Y1)= rouge-l (X, Y2).
The authors therefore propose A weighted longest Common subsequence method (WLCS) that gives higher grades to consecutive Summaries for correct Summaries. For details, see the original paper ROUGE: A Package for Automatic Evaluation of Summaries.
ROUGE – 3.4 S
Rouge-s also counts n-grams, but the n-gram used allows for “Skip” words that do not need to occur consecutively. Skip 2-gram for “I have a cat” includes (I, have), (I, a), (I, cat), (have, a), (have, cat), (a, cat).
4. References
- Bleu: a method for automatic evaluation of machine translation
- ROUGE: A Package for Automatic Evaluation of Summaries