METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments Alon Lavie and Satanjeev Banerjee Language Technologies Institute Carnegie Mellon University With help from: Kenji Sagae, Shyamsundar Jayaraman Similarity-based MT Evaluation Metrics • Assess the “quality” of an MT system by comparing its output with human produced “reference” translations • Premise: the more similar (in meaning) the translation is to the reference, the better • Goal: an algorithm that is capable of accurately approximating the similarity • Wide Range of metrics, mostly focusing on word-level correspondences: – Edit-distance metrics: Levenshtein, WER, PIWER, … – Ngram-based metrics: Precision, Recall, F1-measure, BLEU, NIST, GTM… • Main Issue: perfect word matching is very crude estimate for sentence-level similarity in meaning June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 2 Desirable Automatic Metric • High-levels of correlation with quantified human notions of translation quality • Sensitive to small differences in MT quality between systems and versions of systems • Consistent – same MT system on similar texts should produce similar scores • Reliable – MT systems that score similarly will perform similarly • General – applicable to a wide range of domains and scenarios • Fast and lightweight – easy to run June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 3 Some Weaknesses in BLEU • BLUE matches word ngrams of MT-translation with multiple reference translations simultaneously Precision-based metric – Is this better than matching with each reference translation separately and selecting the best match? • BLEU Compensates for Recall by factoring in a “Brevity Penalty” (BP) – Is the BP adequate in compensating for lack of Recall? • BLEU’s ngram matching requires exact word matches – Can stemming and synonyms improve the similarity measure and improve correlation with human scores? • All matched words weigh equally in BLEU – Can a scheme for weighing word contributions improve correlation with human scores? • BLEU’s Higher order ngrams account for fluency and grammaticality, ngrams are geometrically averaged – Geometric ngram averaging is volatile to “zero” scores. Can we account for fluency/grammaticality via other means? June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 4 Unigram-based Metrics • Unigram Precision: fraction of words in the MT that appear in the reference • Unigram Recall: fraction of the words in the reference translation that appear in the MT • F1= P*R/0.5*(P+R) • Fmean = P*R/(0.9*P+0.1*R) • Generalized Unigram matches: – Exact word matches, stems, synonyms • Match with each reference separately and select the best match for each sentence June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 5 The METEOR Metric • Main new ideas: – Reintroduce Recall and combine it with Precision as score components – Look only at unigram Precision and Recall – Align MT output with each reference individually and take score of best pairing – Matching takes into account word inflection variations (via stemming) and synonyms (via WordNet synsets) – Address fluency via a direct penalty: how fragmented is the matching of the MT output with the reference? June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 6 The Alignment Matcher • Find the best word-to-word alignment match between two strings of words – Each word in a string can match at most one word in the other string – Matches can be based on generalized criteria: word identity, stem identity, synonymy… – Find the alignment of highest cardinality with minimal number of crossing branches • Optimal search is NP-complete – Clever search with pruning is very fast and has near optimal results • Greedy three-stage matching: exact, stem, synonyms June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 7 The Full METEOR Metric • Matcher explicitly aligns matched words between MT and reference • Matcher returns fragment count (frag) – used to calculate average fragmentation – (frag -1)/(length-1) • METEOR score calculated as a discounted Fmean score – Discounting factor: DF = 0.5 * (frag**3) – Final score: Fmean * (1- DF) • Scores can be calculated at sentence-level • Aggregate score calculated over entire test set (similar to BLEU) June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 8 METEOR Metric • Effect of Discounting Factor: Fragmentation Factor 1.2 Factor 1 0.8 0.6 0.4 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fragmentation June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 9 The METEOR Metric • Example: – Reference: “the Iraqi weapons are to be handed over to the army within two weeks” – MT output: “in two weeks Iraq’s weapons will give army” • Matching: Ref: Iraqi weapons army two weeks MT: two weeks Iraq’s weapons army • • • • • P = 5/8 =0.625 R = 5/14 = 0.357 Fmean = 10*P*R/(9P+R) = 0.3731 Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50 Discounting factor: DF = 0.5 * (frag**3) = 0.0625 Final score: Fmean * (1- DF) = 0.3731*0.9375 = 0.3498 June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 10 Evaluating METEOR • How do we know if a metric is better? – Better correlation with human judgments of MT output – Reduced score variability on MT sentence outputs that are ranked equivalent by humans – Higher and less variable scores on scoring human translations (references) against other human (reference) translations June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 11 Correlation with Human Judgments • Human judgment scores for adequacy and fluency, each [1-5] (or sum them together) • Pearson or spearman (rank) correlations • Correlation of metric scores with human scores at the system level – Can rank systems – Even coarse metrics can have high correlations • Correlation of metric scores with human scores at the sentence level – – – – Evaluates score correlations at a fine-grained level Very large number of data points, multiple systems Pearson correlation Look at metric score variability for MT sentences scored as equally good by humans June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 12 Evaluation Setup • Data: LDC Released Common data-set (DARPA/TIDES 2003 Chinese-to-English and Arabic-to-English MT evaluation data) • Chinese data: – 920 sentences, 4 reference translations – 7 systems • Arabic data: – 664 sentences, 4 reference translations – 6 systems • Metrics Compared: BLEU, P, R, F1, Fmean, METEOR (with several features) June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 13 Evaluation Results: System-level Correlations Chinese data Arabic data Average BLEU 0.828 0.930 0.879 Mod-BLEU 0.821 0.926 0.874 Precision 0.788 0.906 0.847 Recall 0.878 0.954 0.916 F1 0.881 0.971 0.926 Fmean 0.881 0.964 0.922 METEOR 0.896 0.971 0.934 June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 14 Evaluation Results: Sentence-level Correlations Chinese data Arabic data Average BLEU 0.194 0.228 0.211 Mod-BLEU 0.285 0.307 0.296 Precision 0.286 0.288 0.287 Recall 0.320 0.335 0.328 Fmean 0.327 0.340 0.334 METEOR 0.331 0.347 0.339 June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 15 Adequacy, Fluency and Combined: Sentence-level Correlations Arabic Data Adequacy Fluency Combined BLEU 0.239 0.171 0.228 Mod-BLEU 0.315 0.238 0.307 Precision 0.306 0.210 0.288 Recall 0.362 0.236 0.335 Fmean 0.367 0.240 0.340 METEOR 0.370 0.252 0.347 June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 16 METEOR Mapping Modules: Sentence-level Correlations Chinese data Arabic data Average Exact 0.293 0.312 0.303 Exact+Pstem 0.318 0.329 0.324 Exact+WNste 0.312 0.330 0.321 Exact+Pstem +WNsyn 0.331 0.347 0.339 June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 17 Normalizing Human Scores • Human scores are noisy: – Medium-levels of intercoder agreement, Judge biases • MITRE group performed score normalization – Normalize judge median score and distributions • Significant effect on sentence-level correlation between metrics and human scores Chinese data Arabic data Average Raw Human Scores 0.331 0.347 0.339 Normalized Human Scores 0.365 0.403 0.384 June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 18 METEOR vs. BLEU Sentence-level Scores (CMU SMT System, TIDES 2003 Data) R=0.2466 R=0.4129 BLEU Sentence Scores vs. Total Human Score METEOR Sentence Scores vs. Total Human Scores 0.8 1 0.9 0.7 0.8 0.6 0.7 Automatic Score BLEU Score 0.5 0.4 y = 0.03x - 0.0152 R2 = 0.0608 0.3 0.6 y = 0.0425x + 0.2788 R2 = 0.1705 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 Total Human Score 4 5 6 7 8 9 10 Total Human Score BLEU June 29, 2005 3 METEOR ACL-05 Workshop on MT and Summarization Evaluation 19 METEOR vs. BLEU Histogram of Scores of Reference Translations 2003 Data Mean=0.6504 STD=0.1310 Mean=0.3727 STD=0.2138 Histogram of BLEU Scores for each Reference Translation Histogram of METEOR Scores for each Reference Translation 500 700 450 600 400 500 Number of Sentences Number of Sentences 350 300 250 200 150 400 300 200 100 100 50 0 0 0-.05 .05-.1 .1-.15 .15-.2 .2-.25 .25-.3 .3-.35 .35-.4 .4-.45 .45-.5 .5-.55 .55-.6 .6-.65 .65-.7 .7-.75 .75-.8 .8-.85 .85-.9 .9-.95 .95-.1 0-.05 .05-.1 .1-.15 .15-.2 .2-.25 .25-.3 .3-.35 .35-.4 .4-.45 .45-.5 .5-.55 .55-.6 .6-.65 .65-.7 .7-.75 .75-.8 .8-.85 .85-.9 .9-.95 .95-.1 Range of BLEU Scores Range of METEOR Scores BLEU June 29, 2005 METEOR ACL-05 Workshop on MT and Summarization Evaluation 20 Future Work • Optimize fragmentation penalty function based on training data • Extend Matcher to capture word correspondences with semantic relatedness (synonyms and beyond) • Evaluate importance and effect of the number of reference translations need for synthetic references • Schemes for weighing the matched words – Based on their translation importance – Based on their similarity-level • Consider other schemes for capturing fluency/grammaticality/coherence of MT output June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 21 Using METEOR • METEOR software package freely available and downloadable on web: http://www.cs.cmu.edu/~alavie/METEOR/ • Required files and formats identical to BLEU if you know how to run BLEU, you know how to run METEOR!! • We welcome comments and bug reports… June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 22 Conclusions • Recall more important than Precision • Importance of focusing on sentence-level correlations • Sentence-level correlations are still rather low (and noisy), but significant steps in the right direction – Generalizing matchings with stemming and synonyms gives a consistent improvement in correlations with human judgments • Human judgment normalization is important and has significant effect June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 23 Questions? June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 24 Evaluation Results: 2002 System-level Correlations Metric Pearson Coefficient Confidence Interval BLEU 0.461 ±0.058 BLEU-stemmed 0.528 ±0.061 NIST 0.603 ±0.049 NIST-stemmed 0.740 ±0.043 Precision 0.175 ±0.052 Precision-stemmed 0.257 ±0.065 Recall 0.615 ±0.042 Recall-stemmed 0.757 ±0.042 F1 0.425 ±0.047 F1-stemmed 0.564 ±0.052 Fmean 0.585 ±0.043 Fmean-stemmed 0.733 ±0.044 GTM 0.77 GTM-stemmed 0.68 METEOR June 29, 2005 0.7191 ACL-05 Workshop on MT and Summarization Evaluation 25 Evaluation Results: 2003 System-level Correlations Metric Pearson Coefficient Confidence Interval BLEU 0.817 ±0.021 BLEU-stemmed 0.843 ±0.018 NIST 0.892 ±0.013 NIST-stemmed 0.915 ±0.010 Precision 0.683 ±0.041 Precision-stemmed 0.752 ±0.041 Recall 0.961 ±0.011 Recall-stemmed 0.940 ±0.014 F1 0.909 ±0.025 F1-stemmed 0.948 ±0.014 Fmean 0.959 ±0.012 Fmean-stemmed 0.952 ±0.013 GTM 0.79 GTM-stemmed 0.89 METEOR June 29, 2005 0.964 ACL-05 Workshop on MT and Summarization Evaluation 26 Evaluation Results: 2002 Pairwise Correlations Metric Pearson Coefficient Confidence Interval BLEU 0.498 ±0.054 BLEU-stemmed 0.559 ±0.058 NIST 0.679 ±0.042 NIST-stemmed 0.774 ±0.041 Precision 0.298 ±0.051 Precision-stemmed 0.325 ±0.064 Recall 0.743 ±0.032 Recall-stemmed 0.845 ±0.029 F1 0.549 ±0.042 F1-stemmed 0.643 ±0.046 Fmean 0.711 ±0.033 Fmean-stemmed 0.818 ±0.032 GTM 0.71 GTM-stemmed 0.69 METEOR June 29, 2005 0.8115 ACL-05 Workshop on MT and Summarization Evaluation 27 Evaluation Results: 2003 Pairwise Correlations Metric Pearson Coefficient Confidence Interval BLEU 0.758 ±0.027 BLEU-stemmed 0.793 ±0.025 NIST 0.886 ±0.017 NIST-stemmed 0.924 ±0.013 Precision 0.573 ±0.053 Precision-stemmed 0.666 ±0.058 Recall 0.954 ±0.014 Recall-stemmed 0.923 ±0.018 F1 0.881 ±0.024 F1-stemmed 0.950 ±0.017 Fmean 0.954 ±0.015 Fmean-stemmed 0.940 ±0.017 GTM 0.78 GTM-stemmed 0.86 METEOR June 29, 2005 0.965 ACL-05 Workshop on MT and Summarization Evaluation 28 New Evaluation Results: 2002 System-level Correlations Metric Pearson Coefficient METEOR 0.6016 METEOR+Pstem 0.7177 METEOR+WNstem 0.7275 METEOR+Pst+syn 0.8070 METEOR+WNst+syn 0.7966 METEOR+stop 0.7597 METEOR+Pst+stop 0.8638 METEOR+WNst+stop 0.8783 METEOR+Pst+syn+stop 0.9138 METEOR+WNst+syn+stop 0.9126 F1+Pstem+syn+stop 0.8872 Fmean+WNst+syn+stop 0.8843 Fmean+Pst+syn+stop 0.8799 F1+WNstem+syn+stop 0.8787 June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 29 New Evaluation Results: 2003 System-level Correlations Metric Pearson Coefficient METEOR 0.9175 METEOR+Pstem 0.9621 METEOR+WNstem 0.9705 METEOR+Pst+syn 0.9852 METEOR+WNst+syn 0.9821 METEOR+stop 0.9286 METEOR+Pst+stop 0.9682 METEOR+WNst+stop 0.9764 METEOR+Pst+syn+stop 0.9821 METEOR+WNst+syn+stop 0.9802 Fmean+WNstem 0.9924 Fmean+WNstem+stop 0.9914 Fmean+Pstem 0.9905 Fmean+WNstem+syn 0.9905 June 29, 2005 ACL-05 Workshop on MT and Summarization Evaluation 30 METEOR vs. Mod-BLEU Sentence-level Scores (E09 System, TIDES 2003 Data) R=0.362 R=0.399 Human vs METEOR Scores for system E09 1.2 1.2 R = 0.362 1 METEOR scores Modified Bleu scores Human vs Modified Bleu scores for E09 0.8 0.6 0.4 0.2 0 R = 0.399 1 0.8 0.6 0.4 0.2 0 0 5 10 15 20 0 Human scores 10 15 20 Human scores Mod-BLEU June 29, 2005 5 ACL-05 Workshop on MT and Summarization Evaluation METEOR 31