METEOR: An Automatic Metric for MT Evaluation Alon Lavie and Satanjeev Banerjee

advertisement
METEOR:
Metric for Evaluation of Translation with Explicit Ordering
An Automatic Metric for MT Evaluation
with Improved Correlations with Human Judgments
Alon Lavie and Satanjeev Banerjee
Language Technologies Institute
Carnegie Mellon University
With help from:
Kenji Sagae, Shyamsundar Jayaraman
Similarity-based MT
Evaluation Metrics
• Assess the “quality” of an MT system by comparing its
output with human produced “reference” translations
• Premise: the more similar (in meaning) the translation
is to the reference, the better
• Goal: an algorithm that is capable of accurately
approximating the similarity
• Wide Range of metrics, mostly focusing on word-level
correspondences:
– Edit-distance metrics: Levenshtein, WER, PIWER, …
– Ngram-based metrics: Precision, Recall, F1-measure,
BLEU, NIST, GTM…
• Main Issue: perfect word matching is very crude
estimate for sentence-level similarity in meaning
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
2
Desirable Automatic Metric
• High-levels of correlation with quantified
human notions of translation quality
• Sensitive to small differences in MT quality
between systems and versions of systems
• Consistent – same MT system on similar texts
should produce similar scores
• Reliable – MT systems that score similarly will
perform similarly
• General – applicable to a wide range of
domains and scenarios
• Fast and lightweight – easy to run
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
3
Some Weaknesses in BLEU
• BLUE matches word ngrams of MT-translation with
multiple reference translations simultaneously 
Precision-based metric
– Is this better than matching with each reference
translation separately and selecting the best match?
• BLEU Compensates for Recall by factoring in a “Brevity
Penalty” (BP)
– Is the BP adequate in compensating for lack of Recall?
• BLEU’s ngram matching requires exact word matches
– Can stemming and synonyms improve the similarity
measure and improve correlation with human scores?
• All matched words weigh equally in BLEU
– Can a scheme for weighing word contributions improve
correlation with human scores?
• BLEU’s Higher order ngrams account for fluency and
grammaticality, ngrams are geometrically averaged
– Geometric ngram averaging is volatile to “zero” scores.
Can we account for fluency/grammaticality via other
means?
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
4
Unigram-based Metrics
• Unigram Precision: fraction of words in the MT
that appear in the reference
• Unigram Recall: fraction of the words in the
reference translation that appear in the MT
• F1= P*R/0.5*(P+R)
• Fmean = P*R/(0.9*P+0.1*R)
• Generalized Unigram matches:
– Exact word matches, stems, synonyms
• Match with each reference separately and
select the best match for each sentence
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
5
The METEOR Metric
• Main new ideas:
– Reintroduce Recall and combine it with Precision as
score components
– Look only at unigram Precision and Recall
– Align MT output with each reference individually and
take score of best pairing
– Matching takes into account word inflection
variations (via stemming) and synonyms (via
WordNet synsets)
– Address fluency via a direct penalty: how fragmented
is the matching of the MT output with the reference?
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
6
The Alignment Matcher
• Find the best word-to-word alignment match
between two strings of words
– Each word in a string can match at most one word in
the other string
– Matches can be based on generalized criteria: word
identity, stem identity, synonymy…
– Find the alignment of highest cardinality with
minimal number of crossing branches
• Optimal search is NP-complete
– Clever search with pruning is very fast and has near
optimal results
• Greedy three-stage matching: exact, stem,
synonyms
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
7
The Full METEOR Metric
• Matcher explicitly aligns matched words between MT
and reference
• Matcher returns fragment count (frag) – used to
calculate average fragmentation
– (frag -1)/(length-1)
• METEOR score calculated as a discounted Fmean score
– Discounting factor: DF = 0.5 * (frag**3)
– Final score: Fmean * (1- DF)
• Scores can be calculated at sentence-level
• Aggregate score calculated over entire test set (similar
to BLEU)
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
8
METEOR Metric
• Effect of Discounting Factor:
Fragmentation Factor
1.2
Factor
1
0.8
0.6
0.4
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Fragmentation
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
9
The METEOR Metric
• Example:
– Reference: “the Iraqi weapons are to be handed over to the
army within two weeks”
– MT output: “in two weeks Iraq’s weapons will give army”
• Matching: Ref: Iraqi weapons army two weeks
MT: two weeks Iraq’s weapons army
•
•
•
•
•
P = 5/8 =0.625 R = 5/14 = 0.357
Fmean = 10*P*R/(9P+R) = 0.3731
Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50
Discounting factor: DF = 0.5 * (frag**3) = 0.0625
Final score: Fmean * (1- DF) = 0.3731*0.9375 = 0.3498
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
10
Evaluating METEOR
• How do we know if a metric is better?
– Better correlation with human judgments of
MT output
– Reduced score variability on MT sentence
outputs that are ranked equivalent by
humans
– Higher and less variable scores on scoring
human translations (references) against
other human (reference) translations
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
11
Correlation with Human Judgments
• Human judgment scores for adequacy and fluency,
each [1-5] (or sum them together)
• Pearson or spearman (rank) correlations
• Correlation of metric scores with human scores at the
system level
– Can rank systems
– Even coarse metrics can have high correlations
• Correlation of metric scores with human scores at the
sentence level
–
–
–
–
Evaluates score correlations at a fine-grained level
Very large number of data points, multiple systems
Pearson correlation
Look at metric score variability for MT sentences scored as
equally good by humans
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
12
Evaluation Setup
• Data: LDC Released Common data-set
(DARPA/TIDES 2003 Chinese-to-English and
Arabic-to-English MT evaluation data)
• Chinese data:
– 920 sentences, 4 reference translations
– 7 systems
• Arabic data:
– 664 sentences, 4 reference translations
– 6 systems
• Metrics Compared: BLEU, P, R, F1, Fmean,
METEOR (with several features)
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
13
Evaluation Results:
System-level Correlations
Chinese data
Arabic data
Average
BLEU
0.828
0.930
0.879
Mod-BLEU
0.821
0.926
0.874
Precision
0.788
0.906
0.847
Recall
0.878
0.954
0.916
F1
0.881
0.971
0.926
Fmean
0.881
0.964
0.922
METEOR
0.896
0.971
0.934
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
14
Evaluation Results:
Sentence-level Correlations
Chinese data
Arabic data
Average
BLEU
0.194
0.228
0.211
Mod-BLEU
0.285
0.307
0.296
Precision
0.286
0.288
0.287
Recall
0.320
0.335
0.328
Fmean
0.327
0.340
0.334
METEOR
0.331
0.347
0.339
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
15
Adequacy, Fluency and Combined:
Sentence-level Correlations
Arabic Data
Adequacy
Fluency
Combined
BLEU
0.239
0.171
0.228
Mod-BLEU
0.315
0.238
0.307
Precision
0.306
0.210
0.288
Recall
0.362
0.236
0.335
Fmean
0.367
0.240
0.340
METEOR
0.370
0.252
0.347
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
16
METEOR Mapping Modules:
Sentence-level Correlations
Chinese data
Arabic data
Average
Exact
0.293
0.312
0.303
Exact+Pstem
0.318
0.329
0.324
Exact+WNste
0.312
0.330
0.321
Exact+Pstem
+WNsyn
0.331
0.347
0.339
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
17
Normalizing Human Scores
• Human scores are noisy:
– Medium-levels of intercoder agreement, Judge biases
• MITRE group performed score normalization
– Normalize judge median score and distributions
• Significant effect on sentence-level correlation
between metrics and human scores
Chinese data
Arabic data
Average
Raw Human
Scores
0.331
0.347
0.339
Normalized
Human Scores
0.365
0.403
0.384
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
18
METEOR vs. BLEU
Sentence-level Scores
(CMU SMT System, TIDES 2003 Data)
R=0.2466
R=0.4129
BLEU Sentence Scores vs. Total Human Score
METEOR Sentence Scores vs. Total Human Scores
0.8
1
0.9
0.7
0.8
0.6
0.7
Automatic Score
BLEU Score
0.5
0.4
y = 0.03x - 0.0152
R2 = 0.0608
0.3
0.6
y = 0.0425x + 0.2788
R2 = 0.1705
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
Total Human Score
4
5
6
7
8
9
10
Total Human Score
BLEU
June 29, 2005
3
METEOR
ACL-05 Workshop on MT and
Summarization Evaluation
19
METEOR vs. BLEU
Histogram of Scores of Reference Translations
2003 Data
Mean=0.6504 STD=0.1310
Mean=0.3727 STD=0.2138
Histogram of BLEU Scores for each Reference Translation
Histogram of METEOR Scores for each Reference Translation
500
700
450
600
400
500
Number of Sentences
Number of Sentences
350
300
250
200
150
400
300
200
100
100
50
0
0
0-.05 .05-.1 .1-.15 .15-.2 .2-.25 .25-.3 .3-.35 .35-.4 .4-.45 .45-.5 .5-.55 .55-.6 .6-.65 .65-.7 .7-.75 .75-.8 .8-.85 .85-.9 .9-.95 .95-.1
0-.05 .05-.1 .1-.15 .15-.2 .2-.25 .25-.3 .3-.35 .35-.4 .4-.45 .45-.5 .5-.55 .55-.6 .6-.65 .65-.7 .7-.75 .75-.8 .8-.85 .85-.9 .9-.95 .95-.1
Range of BLEU Scores
Range of METEOR Scores
BLEU
June 29, 2005
METEOR
ACL-05 Workshop on MT and
Summarization Evaluation
20
Future Work
• Optimize fragmentation penalty function based
on training data
• Extend Matcher to capture word correspondences
with semantic relatedness (synonyms and
beyond)
• Evaluate importance and effect of the number of
reference translations  need for synthetic
references
• Schemes for weighing the matched words
– Based on their translation importance
– Based on their similarity-level
• Consider other schemes for capturing
fluency/grammaticality/coherence of MT output
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
21
Using METEOR
• METEOR software package freely
available and downloadable on web:
http://www.cs.cmu.edu/~alavie/METEOR/
• Required files and formats identical to
BLEU  if you know how to run BLEU,
you know how to run METEOR!!
• We welcome comments and bug
reports…
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
22
Conclusions
• Recall more important than Precision
• Importance of focusing on sentence-level
correlations
• Sentence-level correlations are still rather low
(and noisy), but significant steps in the right
direction
– Generalizing matchings with stemming and
synonyms gives a consistent improvement in
correlations with human judgments
• Human judgment normalization is important
and has significant effect
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
23
Questions?
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
24
Evaluation Results:
2002 System-level Correlations
Metric
Pearson Coefficient
Confidence Interval
BLEU
0.461
±0.058
BLEU-stemmed
0.528
±0.061
NIST
0.603
±0.049
NIST-stemmed
0.740
±0.043
Precision
0.175
±0.052
Precision-stemmed
0.257
±0.065
Recall
0.615
±0.042
Recall-stemmed
0.757
±0.042
F1
0.425
±0.047
F1-stemmed
0.564
±0.052
Fmean
0.585
±0.043
Fmean-stemmed
0.733
±0.044
GTM
0.77
GTM-stemmed
0.68
METEOR
June 29, 2005
0.7191
ACL-05 Workshop on MT and
Summarization Evaluation
25
Evaluation Results:
2003 System-level Correlations
Metric
Pearson Coefficient
Confidence Interval
BLEU
0.817
±0.021
BLEU-stemmed
0.843
±0.018
NIST
0.892
±0.013
NIST-stemmed
0.915
±0.010
Precision
0.683
±0.041
Precision-stemmed
0.752
±0.041
Recall
0.961
±0.011
Recall-stemmed
0.940
±0.014
F1
0.909
±0.025
F1-stemmed
0.948
±0.014
Fmean
0.959
±0.012
Fmean-stemmed
0.952
±0.013
GTM
0.79
GTM-stemmed
0.89
METEOR
June 29, 2005
0.964
ACL-05 Workshop on MT and
Summarization Evaluation
26
Evaluation Results:
2002 Pairwise Correlations
Metric
Pearson Coefficient
Confidence Interval
BLEU
0.498
±0.054
BLEU-stemmed
0.559
±0.058
NIST
0.679
±0.042
NIST-stemmed
0.774
±0.041
Precision
0.298
±0.051
Precision-stemmed
0.325
±0.064
Recall
0.743
±0.032
Recall-stemmed
0.845
±0.029
F1
0.549
±0.042
F1-stemmed
0.643
±0.046
Fmean
0.711
±0.033
Fmean-stemmed
0.818
±0.032
GTM
0.71
GTM-stemmed
0.69
METEOR
June 29, 2005
0.8115
ACL-05 Workshop on MT and
Summarization Evaluation
27
Evaluation Results:
2003 Pairwise Correlations
Metric
Pearson Coefficient
Confidence Interval
BLEU
0.758
±0.027
BLEU-stemmed
0.793
±0.025
NIST
0.886
±0.017
NIST-stemmed
0.924
±0.013
Precision
0.573
±0.053
Precision-stemmed
0.666
±0.058
Recall
0.954
±0.014
Recall-stemmed
0.923
±0.018
F1
0.881
±0.024
F1-stemmed
0.950
±0.017
Fmean
0.954
±0.015
Fmean-stemmed
0.940
±0.017
GTM
0.78
GTM-stemmed
0.86
METEOR
June 29, 2005
0.965
ACL-05 Workshop on MT and
Summarization Evaluation
28
New Evaluation Results:
2002 System-level Correlations
Metric
Pearson Coefficient
METEOR
0.6016
METEOR+Pstem
0.7177
METEOR+WNstem
0.7275
METEOR+Pst+syn
0.8070
METEOR+WNst+syn
0.7966
METEOR+stop
0.7597
METEOR+Pst+stop
0.8638
METEOR+WNst+stop
0.8783
METEOR+Pst+syn+stop
0.9138
METEOR+WNst+syn+stop
0.9126
F1+Pstem+syn+stop
0.8872
Fmean+WNst+syn+stop
0.8843
Fmean+Pst+syn+stop
0.8799
F1+WNstem+syn+stop
0.8787
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
29
New Evaluation Results:
2003 System-level Correlations
Metric
Pearson Coefficient
METEOR
0.9175
METEOR+Pstem
0.9621
METEOR+WNstem
0.9705
METEOR+Pst+syn
0.9852
METEOR+WNst+syn
0.9821
METEOR+stop
0.9286
METEOR+Pst+stop
0.9682
METEOR+WNst+stop
0.9764
METEOR+Pst+syn+stop
0.9821
METEOR+WNst+syn+stop
0.9802
Fmean+WNstem
0.9924
Fmean+WNstem+stop
0.9914
Fmean+Pstem
0.9905
Fmean+WNstem+syn
0.9905
June 29, 2005
ACL-05 Workshop on MT and
Summarization Evaluation
30
METEOR vs. Mod-BLEU
Sentence-level Scores
(E09 System, TIDES 2003 Data)
R=0.362
R=0.399
Human vs METEOR Scores for system E09
1.2
1.2
R = 0.362
1
METEOR scores
Modified Bleu scores
Human vs Modified Bleu scores for E09
0.8
0.6
0.4
0.2
0
R = 0.399
1
0.8
0.6
0.4
0.2
0
0
5
10
15
20
0
Human scores
10
15
20
Human scores
Mod-BLEU
June 29, 2005
5
ACL-05 Workshop on MT and
Summarization Evaluation
METEOR
31
Download