Predicting Crowd-Based Translation Quality with Language-Independent Feature Vectors

advertisement
Human Computation
AAAI Technical Report WS-12-08
Predicting Crowd-Based Translation Quality with
Language-Independent Feature Vectors
Niklas Kilian, Markus Krause, Nina Runge, Jan Smeddinck
University of Bremen
nkilian@tzi.de, phateon@tzi.de, nr@tzi.de, smeddinck@tzi.de
In this paper, a method is proposed that classifies individual answers at submission time without the need to
provide or gather information about the source language
and which requires only very limited information about the
target language. The algorithm calculates a vector of taskindependent variables for each answer, and uses a machine
learning algorithm to classify them.
Abstract
Research over the past years has shown that machine translation results can be greatly enhanced with the help of
mono- or bilingual human contributors, e.g. by asking humans to proofread or correct outputs of machine translation
systems. However, it remains difficult to determine the
quality of individual revisions. This paper proposes a method to determine the quality of individual contributions by
analyzing task-independent data. Examples of such data are
completion time, number of keystrokes, etc. An initial evaluation showed promising F-measure values larger than 0.8
for support vector machine and decision tree based classifications of a combined test set of Vietnamese and German
translations.
Language-Independent Quality Control
The proposed algorithm uses thirteen task-independent
variables, for example: the number of keystrokes, the completion time, the number of added/removed/replaced words
and the Levenshtein Distance between MT output and user
correction. The vector also includes more complex calculations using metadata acquired through the Google Translate University API. Translation costs, word alignment
information and alternative translations are included in
these calculations. In addition, a sentence complexity
measurement of the target sentence is used, which is the
only directly language-dependent variable that is included
Keywords: human computation, crowdsourcing, machine translation, answer prediction, rater reliability analysis
Introduction
Over the past years, research has shown that it is possible
to enhance the quality of machine translation (MT) results
by applying human computation. Targeted paraphrasing
(Resnik et al. 2010) and iterative collaboration between
monolingual users (Hu and Bederson 2010) are just two
examples. Another common approach is to ask mono- or
bilingual speakers to proofread and correct MT results.
When acquiring those corrections (e.g. through crowd
services like Amazon Mechanical Turk1), a quality measurement is required in order to identify truthful contributions and to exclude low quality contributors (Zaidan
2011). Common methods – e.g. creating gold standard
questions to verify the quality of contributors – do not
work in a reliable way for natural language tasks, because
the amount of possible translations is too large. Automated
quality measurement systems would require large amounts
of individual corrections, which could then be used as a set
of correct translations to test against. However, this would
be a time-consuming and oftentimes very costly process,
since multiple contributors are required to validate just one
“correct” translation.
"
"
"
#
#
"
"
"
!!
!
!
#
Figure 1 - Evaluation results show the potential of the proposed
method. Based on the F-measure value, the decision tree (DT)
performed best, followed by the support vector machine (SVM),
the feedforward neuronal network (MLP) and Naïve Bayes
(NB).
Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
1
https://www.mturk.com
114
repeatedly at the end of the sentence. The ratio of acceptable to unacceptable units was 63/37 for the German test set
and 59/41 for the Vietnamese test set.
In a final step, each of the four machine learning implementations were trained and evaluated with a) only the
German test set, b) only the Vietnamese test set, and c)
both test sets combined. 10-fold cross validation was used
for all evaluations. The F-measure was calculated for each
of the four machine learning techniques and each language,
as well as for the combined test set containing both languages. Across all three test cases, the resulting numbers
show the potential of the proposed method (see figure 1).
in the feature vector. The feature vector is calculated after
analysing and comparing the initial translation and the
corrected version. It is then used to train a machine learning algorithm. The following implementations of machine
learning algorithms were tested as part of the evaluation
process: support vector machine (LibSVM), common Naïve Bayes, decision tree (C4.5 algorithm) and a simple
feedforward neuronal network (Multi-Layer Perceptron).
Standard parameters were used for all but the support vector machine, which was set to use the RBF kernel with
parameters optimized for the given training set. As some
feature vector elements are time-based, or counters for
actions that have been performed by the contributor, normalization was not applied.
For initial tests, two popular Wikipedia articles were
chosen – one German article dealing with the “Brandenburg Gate” and a Vietnamese article about the “City of Hanoi”. Native speakers of German and Vietnamese prepared
a test set for each language. For each set, the first 150 sentences were taken from the respective article. Headlines,
incomplete sentences and sentences that contained words
or entire phrases in a strong dialect were removed from the
test set. The remaining sentences were then ordered by
word count and the longest 20% were removed. The remaining sentences still had an average length of 15 words.
After the pre-processing, the two test sets were scaled
down to 100 sentences each by randomly removing sentences. The purpose of this final step was to ensure better
comparability between the two tests. Both, the Vietnamese
and German sentences were then translated into English
using Google Translate. A correction for all of the resulting
translations was acquired through Amazon Mechanical
Turk (via Crowdflower), which resulted in realistic results
of varying quality. Each Turker was allowed to work on up
to six corrections and all submissions were accepted without any quality measurements (e.g. gold standard questions, country exclusion) in place.
All 200 submissions were then manually assigned to one
of two classes by multilingual speakers of both source and
target language. The first class represents all corrections
that were done by a contributor who enhanced the translation in a reasonable way. One example for this class is the
sentence “The lost previous grid was renewed for a thorough restoration 1983/84”, which was corrected to “The
previously lost grid was renewed for a thorough restoration
in 1983/84.” The second class contains invalid submissions
and low-quality corrections. Whenever there was a potential to correct the MT result, but no or only little changes
were made, the submission was considered invalid. If a
contributor obviously lowered the quality, or if changes
were made in an insufficient way, the submission was
considered to be invalid as well. Typical examples for this
class are corrections where several words were swapped
randomly, or parts of the sentence got deleted, or added
Discussion & Future Work
Results suggest that the pruned decision tree performed
best in all three test cases. It is showing acceptable results
due to low and balanced false positive and false negative
rates. The support vector machine has a similar F-measure,
but showed a high number of false positives during the
evaluation process. It is therefore less qualified for the task
of sorting out low quality submissions. Although there are
notable differences between the German and Vietnamese
test results, the outcomes of the combined test indicate the
language-independence of the proposed algorithm. Whether the differences between German and Vietnamese can be
attributed to the slightly different ratio between “good” and
“bad” units in the training sets requires further investigation. In addition, future work on the topic will need to
carefully consider the suitability of the individual features,
removing those that have no positive impact on the classification results. Considering this exploratory work with
few training samples, the results look very promising.
Further evaluations with more advanced samples (and with
possibly less pre-processing steps in place) will shine light
on the scalability of this language-independent approach to
translation quality prediction.
Acknowledgments
This work was partially funded by the Klaus Tschira
Foundation through the graduate school “Advances in
Digital Media”.
References
Hu, C., & Bederson, B. (2010). Translation by iterative collaboration between monolingual users. Proceedings of Graphics Interface 2010.
Resnik, P., Quinn, A., & Bederson, B. B. (2010). Improving
Translation via Targeted Paraphrasing. Computational Linguistics, (October), 127-137.
Zaidan, O. (2011). Crowdsourcing translation: professional quality from non-professionals. Proceedings of ACL 2011.
115
Download