Comparison of ROVER and MEMT

advertisement
Comparison of ROVER and MEMT
Shyam Jayaraman
Advisor: Bob Frederking
1 Introduction
A variety of different paradigms for machine translation (MT) have been
developed over the years, ranging from statistical systems that learn mappings between
words and phrases in the source language and their corresponding translations in the
target language, to Interlingua-based systems that perform deep semantic analysis. Each
approach and system has different advantages and disadvantages. While statistical
systems provide broad coverage with little manpower, the quality of the corpus based
systems rarely reach the quality of knowledge based systems. With such a wide range of
approaches to machine translation, it would be beneficial to have an effective framework
for combining these systems into an MT system that carries many of the advantages of
the individual systems and suffers from few of their disadvantages. Attempts at
combining the output of different systems have proved useful in other areas of language
technologies, such as the ROVER approach for speech recognition (Fiscus 1997).
Several approaches to multi-engine machine translation have been proposed.
Since the Pangloss system in 1994, many attempts have been made to combine lattices
from many different MT systems (Frederking et Nirenburg 1994, Frederking et al 1997;
Tidhar & Küssner 2000; Lavie, Probst et al. 2004). In 2001, Bangalore et al used string
alignments between the different translations to train a finite state machine to produce a
consensus translation. The alignment algorithm described in that work allows insertions,
deletions and substitutions. A more recent approach used a more sophisticated alignment
algorithm combined with a language model to generate hypothesis and rank them
(Jayaram et Lavie 2005).
Though many attempts have been made to improve the quality of MT by
combining different MT systems, the improvements don’t seem as remarkable as the
improvements of ROVER on speech data. Though the early systems decoded through
lattices, the last two systems attempted approaches very similar to ROVER itself. Yet
these systems do not always improve the output of the MT systems.
This paper investigates why this is the case. We start by trying to get similar
improvements on speech data with an MEMT approach as ROVER. Then we try to
analyze the differences between the MEMT output and the ROVER.
The rest of this paper is laid in six sections. First, the MEMT system and ROVER
are described. Then we describe the setup. Afterwards we show results of these
experiments and output. We then provide conclusions as well the future direction of this
work.
2 MEMT System
The Multi-Engine Machine Translation (MEMT) system that is used for this
project is a new approach that uses explicit word matching to restrict the search space of
a decoder (Jayaraman et Lavie 2005). MEMT operates on the single “top-best”
translation output produced by each of several MT systems operating on a common input
sentence. MEMT first aligns the words of the different translation systems using a word
alignment matcher. Then, using the alignments provided by the matcher, the system
generates a set of synthetic sentence hypothesis translations, which are then assigned a
score based on the alignment information, the confidence of the individual systems, and a
language model. The hypothesis translation with the best score is selected as the final
output of the MEMT combination.
While generating the synthetic hypothesis translation, the MEMT system assumes
the original translations are word synchronous. This means that each word in one
original translation corresponds to a word in another translation. Therefore when the
MEMT system chooses to put one word in a hypothesis, it tries to mark one word in each
one of other translation as used as well. If the matcher does not find explicit matches
between the selected word and a word in one of the original translations, the MEMT
system tries to create an artificial match which, ideally, is a semantically equivalent
alternative to the selected word. The artificial match is restricted by a matching window,
which specifies how far the system should search ahead to find an appropriate match, and
a part-of-speech dictionary that restricts these artificial matches to be of the same part of
speech. Finally the MEMT system has a notion of a lingering word horizon, which
discards words that are still considered viable alternatives even though several words
after them have been used. A more detailed description of the algorithm is provided in
the paper to be published at EAMT 2005 (Jayaraman et Lavie 2005)
3 ROVER
ROVER takes in speech lattices from different automatic speech recognizers
(ASR) and selects the best output. Each arc in the speech lattice has a start time,
duration, an output word, and a confidence score. ROVER has many different options
for choosing which of the many arcs to use. The first method chooses the word such that
the most number of original ASR systems choose that word. For instance, if three
systems suggest “Washington” and one system suggests “washing ton”, this method
would choose “Washington” as the output. Another method computes the average
confidence scores assigned to the arcs. For instance in the previous example, if two
systems suggest “Washington” with 0.75 confidence, one system suggests “Washington”
with 1.0 confidence, and one system suggests “washing ton” with .90 confidence, this
method would choose “washing ton”. The final method selects the arc with the highest
score. In the previous example, the chosen arc would be “Washington”, since one of the
original ASR systems has extremely high confidence in it. In the experiments, the
ROVER method was the average confidence score.
4 Experiments
Two speech datasets were chosen for this experiment. First one was a subset of
travel reservation conversations. This subset contained only sentences with more than
five words in the output, since the MEMT system might not be reliable on short
fragments. There were a total of 200 sentences in this dataset. The second dataset was
set of naval fleet management dialogue. The sentences were fairly long, averaging about
ten words a sentence. There were 400 sentences in that set.
ROVER and MEMT were run over all the original ASR outputs. For the MEMT
system, a matching window of three, a lingering word horizon of five, and a small partof-speech dictionary are used. The output of the original ASR systems, ROVER, and
MEMT were compared using the word error rate (WER), the word accuracy of the
output, and the METEOR metric. The WER is calculated by dividing the number of
insertions, deletions, and substitutions needed to “fix” the output by the number of words
in the output. This number could be greater than 100%. The accuracy is the number of
words right divided by the number of words in the transcription. METEOR is a MT
evaluation metric that is biased heavily to recall (Lavie, Sagae, et al 2004).
After getting back the first set of results, MEMT was run again with more
aggressive settings that intuitively seemed more tuned to speech data. Since there is not
much reordering of words in speech output, the matching window and lingering horizon
was set tightly to prevent the artificial alignment from reordering the words. Also, if
speech recognizers made errors in the transcription, the errors are not likely produce
words with similar part of speech. These errors would result in words that sound the
same, which may or may not be of the same part of speech. Therefore, the MEMT
system as run again with a lingering word horizon of zero, a matching window of one,
and no part of speech dictionary.
Finally, to see if the word reordering was the main cause of the lower MEMT
score, the output of MEMT was reordered so the ordering matched the ordering of the
reference transcription. Extraneous words and incorrect words were still left in the
output. The hope is that the errors removed after this “cleanup” pass would be the same
kind of errors that ROVER would be apt to make.
5 Results
System
mfcc
lda
pca
plp
ROVER
MEMT
MEMT v2
MEMT with fixed
ordering
WER
10.3%
8.1%
9.1%
11.5%
8.4%
15.0%
11.0%
10.2%
Word Accuracy
92.2%
93.5%
92.7%
91.1%
93.6%
92.7%
92.2%
92.4%
METEOR
0.9071
0.9184
0.9103
0.8987
0.9188
0.9147
-
Table 1 Scores on the Naval Fleet Management data
In the naval fleet management dataset, all the original ASR systems do fairly well.
The WERs are less than 15% and METEOR gives them all a fairly high score. The best
original ASR system is lda. The worst system, plp, does not seem too far behind.
ROVER does not improve the WER of the original output, but does slightly improve the
accuracy and METEOR scores. The baseline MEMT system has a WER higher than any
of the original ASR outputs. The accuracy of the MEMT output was middle of the pack.
The METEOR metric ranks the original MEMT system fairly high. The more
aggressively tuned parameters reduce the WER of the MEMT output as well as decrease
the accuracy slightly. The WER is still fairly high, but the MEMT system no longer
does worse than all of the original ASR systems. When the output of the aggressive
MEMT system is reordered the “right” way, the WER is reduced even more, but the
result is still worse several of the original ASR output as well as ROVER.
System
F00
F01
M00
M01
Rover
MEMT
MEMT v2
MEMT with fixed
ordering
WER
22.6%
22.3%
36.1%
37.0%
28.7%
36.6%
40.1%
34.3%
Word Accuracy
81.1%
81.1%
68.2%
66.7%
83.3%
74.9%
74.0%
75.5%
METEOR
0.4601
0.4606
0.3869
0.3811
0.4637
0.4255
0.4160
0.4160
Table 2 Scores on the Travel Reservation Data
In the travel reservation dataset, all the original ASR systems do poorly. The
WERs are greater than 20%. There is a large difference between the best system and the
worst system. The METEOR score and the word accuracy agree on the rankings of the
system. The WER rankings are different. ROVER does not improve the WER of the
ASR output. The MEMT system has a higher WER compared to the ROVER. The more
aggressively tuned parameters increase the WER of the system, which does not seem
intuitive. Re-ordering the output of the aggressively tuned MEMT system decreases the
error rate as compared to the original MEMT system, but still does not come close to the
WER of the ROVER output.
6 Conclusion
ROVER does not improve the output of the ASR systems that are used in these
experiments. The original MEMT system performs worse on these datasets than
ROVER. A more aggressively tuned MEMT system performs slightly better than the
standard system, but is still worse than the ROVER output. One of the reasons for this
seems to be the reordering that MEMT system allows, which do not occur in speech
output. Undoing the reordering done by MEMT improves the WER of the output of the
MEMT system, but still does bring MEMT to the level of ROVER.
7 Future Work
In order to have any conclusive results, a dataset where ROVER works is needed.
On this dataset, if the MEMT system still does worse than ROVER, we can identify
where ROVER is doing something right and MEMT is doing something wrong. On this
dataset, we can perturb the input so that it more matches MT output and run both
ROVER and MEMT on the new data. By doing so, we can see where the performance of
ROVER and MEMT degrade, which allows us to see which differences between speech
and MT data, if any, are responsible for making combining MT output harder than
combining Speech output.
8 References
Bangalore, S., G.Bordel, and G. Riccardi (2001). Computing Consensus Translation from
Multiple Machine Translation Systems. In Proceedings of IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU-2001), Italy.
Fiscus, J. G.(1997). A Post-processing System to Yield Reduced Error Word Rates:
Recognizer Output Voting Error Reduction (ROVER). In IEEE Workshop on
Automatic Speech Recognition and Understanding (ASRU-1997).
Frederking, R. and S. Nirenburg. Three Heads are Better than One. In Proceedings of the
Fourth Conference on Applied Natural Language Processing (ANLP-94),
Stuttgart, Germany, 1994.
Hogan, C. and R.E.Frederking (1998). An Evaluation of the Multi-engine MT
Architecture. In Proceedings of the Third Conference of the Association for
Machine Translation in the Americas, pp. 113-123. Springer-Verlag, Berlin .
Jayaraman, S. and A. Lavie. "Multi-Engine Machine Translation Guided by Explicit
Word Matching". To appear in Proceedings of the 10th Annual Conference of the
European Association for Machine Translation (EAMT-2005), Budapest,
Hungary, May 2005
Lavie, A., K. Probst, E. Peterson, S. Vogel, L.Levin, A. Font-Llitjos and J. Carbonell
(2004). A Trainable Transfer-based Machine Translation Approach for Languages
with Limited Resources. In Proceedings of Workshop of the European
Association for Machine Translation (EAMT-2004), Valletta, Malta.
Lavie, A., K. Sagae and S. Jayaraman. “The Significance of Recall in Automatic Metrics
for MT Evaluation" (2004). In Proceedings of the 6th Conference of the
Association for Machine Translation in the Americas (AMTA-2004),
Washington, DC, September.
Tidhar, Dan and U. Küssner (2000). Learning to Select a Good Translation. In
Proceedings of the 17th conference on Computational linguistics (COLING 2000),
Saarbrücken, Germany.
Download