Comparison of ROVER and MEMT Shyam Jayaraman Advisor: Bob Frederking 1 Introduction A variety of different paradigms for machine translation (MT) have been developed over the years, ranging from statistical systems that learn mappings between words and phrases in the source language and their corresponding translations in the target language, to Interlingua-based systems that perform deep semantic analysis. Each approach and system has different advantages and disadvantages. While statistical systems provide broad coverage with little manpower, the quality of the corpus based systems rarely reach the quality of knowledge based systems. With such a wide range of approaches to machine translation, it would be beneficial to have an effective framework for combining these systems into an MT system that carries many of the advantages of the individual systems and suffers from few of their disadvantages. Attempts at combining the output of different systems have proved useful in other areas of language technologies, such as the ROVER approach for speech recognition (Fiscus 1997). Several approaches to multi-engine machine translation have been proposed. Since the Pangloss system in 1994, many attempts have been made to combine lattices from many different MT systems (Frederking et Nirenburg 1994, Frederking et al 1997; Tidhar & Küssner 2000; Lavie, Probst et al. 2004). In 2001, Bangalore et al used string alignments between the different translations to train a finite state machine to produce a consensus translation. The alignment algorithm described in that work allows insertions, deletions and substitutions. A more recent approach used a more sophisticated alignment algorithm combined with a language model to generate hypothesis and rank them (Jayaram et Lavie 2005). Though many attempts have been made to improve the quality of MT by combining different MT systems, the improvements don’t seem as remarkable as the improvements of ROVER on speech data. Though the early systems decoded through lattices, the last two systems attempted approaches very similar to ROVER itself. Yet these systems do not always improve the output of the MT systems. This paper investigates why this is the case. We start by trying to get similar improvements on speech data with an MEMT approach as ROVER. Then we try to analyze the differences between the MEMT output and the ROVER. The rest of this paper is laid in six sections. First, the MEMT system and ROVER are described. Then we describe the setup. Afterwards we show results of these experiments and output. We then provide conclusions as well the future direction of this work. 2 MEMT System The Multi-Engine Machine Translation (MEMT) system that is used for this project is a new approach that uses explicit word matching to restrict the search space of a decoder (Jayaraman et Lavie 2005). MEMT operates on the single “top-best” translation output produced by each of several MT systems operating on a common input sentence. MEMT first aligns the words of the different translation systems using a word alignment matcher. Then, using the alignments provided by the matcher, the system generates a set of synthetic sentence hypothesis translations, which are then assigned a score based on the alignment information, the confidence of the individual systems, and a language model. The hypothesis translation with the best score is selected as the final output of the MEMT combination. While generating the synthetic hypothesis translation, the MEMT system assumes the original translations are word synchronous. This means that each word in one original translation corresponds to a word in another translation. Therefore when the MEMT system chooses to put one word in a hypothesis, it tries to mark one word in each one of other translation as used as well. If the matcher does not find explicit matches between the selected word and a word in one of the original translations, the MEMT system tries to create an artificial match which, ideally, is a semantically equivalent alternative to the selected word. The artificial match is restricted by a matching window, which specifies how far the system should search ahead to find an appropriate match, and a part-of-speech dictionary that restricts these artificial matches to be of the same part of speech. Finally the MEMT system has a notion of a lingering word horizon, which discards words that are still considered viable alternatives even though several words after them have been used. A more detailed description of the algorithm is provided in the paper to be published at EAMT 2005 (Jayaraman et Lavie 2005) 3 ROVER ROVER takes in speech lattices from different automatic speech recognizers (ASR) and selects the best output. Each arc in the speech lattice has a start time, duration, an output word, and a confidence score. ROVER has many different options for choosing which of the many arcs to use. The first method chooses the word such that the most number of original ASR systems choose that word. For instance, if three systems suggest “Washington” and one system suggests “washing ton”, this method would choose “Washington” as the output. Another method computes the average confidence scores assigned to the arcs. For instance in the previous example, if two systems suggest “Washington” with 0.75 confidence, one system suggests “Washington” with 1.0 confidence, and one system suggests “washing ton” with .90 confidence, this method would choose “washing ton”. The final method selects the arc with the highest score. In the previous example, the chosen arc would be “Washington”, since one of the original ASR systems has extremely high confidence in it. In the experiments, the ROVER method was the average confidence score. 4 Experiments Two speech datasets were chosen for this experiment. First one was a subset of travel reservation conversations. This subset contained only sentences with more than five words in the output, since the MEMT system might not be reliable on short fragments. There were a total of 200 sentences in this dataset. The second dataset was set of naval fleet management dialogue. The sentences were fairly long, averaging about ten words a sentence. There were 400 sentences in that set. ROVER and MEMT were run over all the original ASR outputs. For the MEMT system, a matching window of three, a lingering word horizon of five, and a small partof-speech dictionary are used. The output of the original ASR systems, ROVER, and MEMT were compared using the word error rate (WER), the word accuracy of the output, and the METEOR metric. The WER is calculated by dividing the number of insertions, deletions, and substitutions needed to “fix” the output by the number of words in the output. This number could be greater than 100%. The accuracy is the number of words right divided by the number of words in the transcription. METEOR is a MT evaluation metric that is biased heavily to recall (Lavie, Sagae, et al 2004). After getting back the first set of results, MEMT was run again with more aggressive settings that intuitively seemed more tuned to speech data. Since there is not much reordering of words in speech output, the matching window and lingering horizon was set tightly to prevent the artificial alignment from reordering the words. Also, if speech recognizers made errors in the transcription, the errors are not likely produce words with similar part of speech. These errors would result in words that sound the same, which may or may not be of the same part of speech. Therefore, the MEMT system as run again with a lingering word horizon of zero, a matching window of one, and no part of speech dictionary. Finally, to see if the word reordering was the main cause of the lower MEMT score, the output of MEMT was reordered so the ordering matched the ordering of the reference transcription. Extraneous words and incorrect words were still left in the output. The hope is that the errors removed after this “cleanup” pass would be the same kind of errors that ROVER would be apt to make. 5 Results System mfcc lda pca plp ROVER MEMT MEMT v2 MEMT with fixed ordering WER 10.3% 8.1% 9.1% 11.5% 8.4% 15.0% 11.0% 10.2% Word Accuracy 92.2% 93.5% 92.7% 91.1% 93.6% 92.7% 92.2% 92.4% METEOR 0.9071 0.9184 0.9103 0.8987 0.9188 0.9147 - Table 1 Scores on the Naval Fleet Management data In the naval fleet management dataset, all the original ASR systems do fairly well. The WERs are less than 15% and METEOR gives them all a fairly high score. The best original ASR system is lda. The worst system, plp, does not seem too far behind. ROVER does not improve the WER of the original output, but does slightly improve the accuracy and METEOR scores. The baseline MEMT system has a WER higher than any of the original ASR outputs. The accuracy of the MEMT output was middle of the pack. The METEOR metric ranks the original MEMT system fairly high. The more aggressively tuned parameters reduce the WER of the MEMT output as well as decrease the accuracy slightly. The WER is still fairly high, but the MEMT system no longer does worse than all of the original ASR systems. When the output of the aggressive MEMT system is reordered the “right” way, the WER is reduced even more, but the result is still worse several of the original ASR output as well as ROVER. System F00 F01 M00 M01 Rover MEMT MEMT v2 MEMT with fixed ordering WER 22.6% 22.3% 36.1% 37.0% 28.7% 36.6% 40.1% 34.3% Word Accuracy 81.1% 81.1% 68.2% 66.7% 83.3% 74.9% 74.0% 75.5% METEOR 0.4601 0.4606 0.3869 0.3811 0.4637 0.4255 0.4160 0.4160 Table 2 Scores on the Travel Reservation Data In the travel reservation dataset, all the original ASR systems do poorly. The WERs are greater than 20%. There is a large difference between the best system and the worst system. The METEOR score and the word accuracy agree on the rankings of the system. The WER rankings are different. ROVER does not improve the WER of the ASR output. The MEMT system has a higher WER compared to the ROVER. The more aggressively tuned parameters increase the WER of the system, which does not seem intuitive. Re-ordering the output of the aggressively tuned MEMT system decreases the error rate as compared to the original MEMT system, but still does not come close to the WER of the ROVER output. 6 Conclusion ROVER does not improve the output of the ASR systems that are used in these experiments. The original MEMT system performs worse on these datasets than ROVER. A more aggressively tuned MEMT system performs slightly better than the standard system, but is still worse than the ROVER output. One of the reasons for this seems to be the reordering that MEMT system allows, which do not occur in speech output. Undoing the reordering done by MEMT improves the WER of the output of the MEMT system, but still does bring MEMT to the level of ROVER. 7 Future Work In order to have any conclusive results, a dataset where ROVER works is needed. On this dataset, if the MEMT system still does worse than ROVER, we can identify where ROVER is doing something right and MEMT is doing something wrong. On this dataset, we can perturb the input so that it more matches MT output and run both ROVER and MEMT on the new data. By doing so, we can see where the performance of ROVER and MEMT degrade, which allows us to see which differences between speech and MT data, if any, are responsible for making combining MT output harder than combining Speech output. 8 References Bangalore, S., G.Bordel, and G. Riccardi (2001). Computing Consensus Translation from Multiple Machine Translation Systems. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU-2001), Italy. Fiscus, J. G.(1997). A Post-processing System to Yield Reduced Error Word Rates: Recognizer Output Voting Error Reduction (ROVER). In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU-1997). Frederking, R. and S. Nirenburg. Three Heads are Better than One. In Proceedings of the Fourth Conference on Applied Natural Language Processing (ANLP-94), Stuttgart, Germany, 1994. Hogan, C. and R.E.Frederking (1998). An Evaluation of the Multi-engine MT Architecture. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas, pp. 113-123. Springer-Verlag, Berlin . Jayaraman, S. and A. Lavie. "Multi-Engine Machine Translation Guided by Explicit Word Matching". To appear in Proceedings of the 10th Annual Conference of the European Association for Machine Translation (EAMT-2005), Budapest, Hungary, May 2005 Lavie, A., K. Probst, E. Peterson, S. Vogel, L.Levin, A. Font-Llitjos and J. Carbonell (2004). A Trainable Transfer-based Machine Translation Approach for Languages with Limited Resources. In Proceedings of Workshop of the European Association for Machine Translation (EAMT-2004), Valletta, Malta. Lavie, A., K. Sagae and S. Jayaraman. “The Significance of Recall in Automatic Metrics for MT Evaluation" (2004). In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004), Washington, DC, September. Tidhar, Dan and U. Küssner (2000). Learning to Select a Good Translation. In Proceedings of the 17th conference on Computational linguistics (COLING 2000), Saarbrücken, Germany.