Evaluation of Machine Translation System

Automatic Evaluation of English-Indian Languages Machine Translation
Manisha†, Madhavi Sinha++, Rekha Govil†
Institute, Banasthali University, Banasthali(Raj.)-304022
smanisha@jp.banasthali.ac.in, grekha@banasthali.ac.in
BIT(Extension Center of BISR, Ranchi) Jaipur(Raj.)-302017
Abstract : Machine Translation(MT) refers to the
Amongst many challenges that Natural Language
use of a machine for performing translation task which
Processing(NLP) presents the biggest is the inherent
converts text in one Natural Language(NL) into another
ambiguity of natural language. In addition, the
Natural Language. Evaluation of MT is a difficult task
linguistic diversity between the source and target
because there may exist many perfect translations of a
language makes MT a bigger challenge. This is
given source sentence. Human evaluation is holy grail
particularly true of languages widely divergent in their
for MT evaluation, but due to lack of time and money it
sentence structure such as English and Indian
is becoming impractical. In past years many automatic
languages. The major structural difference between
MT evaluation techniques have been developed, most
English and Indian languages is that English follows
of them are based on n-gram metric evaluation. In this
structure as Subject-Verb-Object, whereas, Indian
paper authors are discussing various problems and
solutions for the automatic evaluation of English-Indian
morphology, relatively free word order, and default
Languages MT because all these techniques can not be
sentence structure as Subject-Object-Verb[3,4].
applied as it is in evaluating English-Indian language
As is recognized the world over, with the current state
MT systems due to the structural differences in the
of art in MT, it is not possible to have fully automatic,
languages involved in the MT language pair.
high quality, and general-purpose Machine Translation.
The major cause being need to handle ambiguity and
the other complexities of NLP in practical systems.
The word ‘Translation’ refers to transformation of one
language into other. MT means automatic translation
Evaluation of a MT system is as important as the MT
of text by computer from one natural language into
itself, answering the questions about the accuracy,
another natural language. Work on Machine Translation
fluency and acceptability of the translation and thus
started in the 1950s after the second world war. The
artifying the underlying MT algorithm. Evaluation has
Georgetown experiment in 1954 involved fully
long been a tough task in the development of MT
automatic translation of more than sixty Russian
systems because there may exist more than one correct
sentences into English. The experiment was a great
translations of the given sentence. The problem with
success and ushered in an era of machine translation
natural language is that language is not exact in the way
research. Today there are many software available for
that mathematical models and theories in science are.
translating natural languages between themselves[1,2].
Natural language has some degree of vagueness which
makes it hard in MT to put objective numbers to it for
Human evaluation is the holy grail for the evaluation of
the evaluation.
machine translation system, however it
is time
consuming, costly and subjective i.e. evaluation results
vary from person to person for the same sentence pair.
Evaluation is needed to compare the performance of
The simplest evaluation question to a human expert can
different MT engines or to improve the performance of
be “Is the translation good?” (Yes/No). This answer
a specific MT engine. Although there is a general
agreement about the basic features of evaluation of MT,
translation but not a detailed one. Given below is a
there are no universally accepted and reliable methods
suggestive list of criteria on which human can evaluate
and measures for the same. MT evaluation typically
the MT. The list has been typically drawn for English-
includes features not present in evaluation of other NLP
Hindi language pair.
systems. These are typically the quality of the
Gender/Number is properly translated or not ?
raw(unedited) translation - intelligibility, accuracy,
Tense in the translated sentence is proper or not ?
fidelity, appropriateness and style/register and added
Voice of a sentence (i.e. active or passive) is
features such as the usability for creating and updating
give only the overall quality measure of the
properly translated or not?
dictionaries, for post editing texts, for controlling input
Use of proper noun in the translation
language, for customization of the documents, the
adjective and adverb corresponding to the nouns
extendibility to new language pairs and/or new subject
and verbs
domains, and cost benefit comparisons with human
The selection of words in the target language
translation performance.
The order of words
Three types of evaluation are recognized for MT[5,6] :
Use of punctuation symbols
adequacy evaluation to determine the fitness of
stress on the significant part
MT systems within a specified operational
maintaining the semantic of the source sentence
Overall quality of the translation, which may
diagnostic evaluation to identify limitations,
include localization issues eg format of date, use
errors and deficiencies, which may be corrected
of colors etc.
or improved (by the research team or by the
The other important issue for
performance evaluation to assess stages of
making the decision on weights to be given to each of
system development or
the above criterian to compute the final
different technical
human evaluation is
Basically these weights are dependent on the nature of
Adequacy evaluation is typically performed
the source and target languages e.g In Sanskrit
Language criteria 7 can be given less weight as
systems(individuals, companies or agencies); diagnostic
opposite to criteria 6.
evaluation is the concern mainly of researchers and
The scale of evaluation can not be binary(T/F). To
developers; and performance evaluation may be
better judge the quality of translation one can deploy a
undertaken by either researchers/developers or by
three point or five point scale with each of the above
potential users.
criteria, e.g.
But deploying human evaluation for assessing a MT
system is very expensive in terms of time and cost
involved in annotating the MT output. Moreover, for a
Partially Acceptable
more statistically significant result and elimination of
Not Acceptable
subjective evaluation, human evaluation of each MT
Figure 1 shows a screen shot of a web application
output needs to be done by more than one evaluator,
developed for human evaluation of MT(English-Indian
making the cost of human evaluation prohibitive. All
Languages) displaying the evaluation screen for the
these problems created interest in automatic evaluation
human expert:
evaluation methods, BLEU[7] and NIST[8] are based
on the assumption ‘The closer a machine translation is
to a professional human translation, the better it is’.
Both are based on n-gram matching approach. Some
other methods are F-Measure and Meteor. The basic
approach of these methods is described below:
IBM’s Bleu[7] - The Bleu metric is probably
known as the best known automatic evaluation
Machine Translation. To check how close a candidate
translation is to a reference translation, a n-gram
is designed from matching of
translation and reference translations.
Main idea of BLEU includes :
Exact matches of words
Match against a set of reference translations for
greater variety of expressions
Account for adequacy by looking at word
Figure 1 : Screenshot of a web application for
human evaluation of MT.
Account for fluency by calculating n-gram
precision for n=1,2,3,4 etc.
Weighted average of all the criterion can be used to
No recall
Final score is weighted geometric average of the
find the overall measure of human evaluation. One can
n-gram scores
add other measures also e.g. Extendibility, Operational
capabilities of the system and Efficiency of use etc.
Weakness: Sentences framed by switching words with
Calculate aggregate score over a large test set
completely different meaning also get high score.
There are some modifications in the original BLEU as
Calculating BLEU
given below :
Final BLEU Score is calculated as:
BLEU = BP.exp (∑
NIST[8] - developed by National Institute of
BLEU’s Modified n-gram Precision for multiple
Standards and Technology, the NIST scoring system
evaluation i.e. only one sentence is evaluated at a time.
(Doddington, 2002),
For a block of text, first the
n-gram scores are
precision but it employs the arithmetic average of n-
calculated for all candidates up to a number N. Then
gram counts rather than a geometric average, and the n-
n-gram counts for all the candidates are added and the
grams in this case are weighted according to their
sum is divided by number of candidate n-grams in
information contribution, as opposed to just counting
corpus. It is denoted by Pn
them as in BLEU. The score represents the average
is again based on n-gram
information per word, given by the n-grams in the
translation that match an n-gram of a reference in the
the number of n-grams in segment i,
in the translation being evaluated, with
a matching reference co occurrence in
segment i
reference set. NIST's brevity penalty penalizes very
short translations more heavily, and sentences close in
Pn =
length to the references less than the BLEU brevity
the number of n-grams in segment i, in
the translation being evaluated
Being based on the BLEU metric it
has the same points.
Brevity Penalty (BP)
Brevity Penalty was introduced to penalize candidates
‘maximum matching’ from Graph theory, subset of
shorter than references. Its main purpose is to prevent
co-occurences in the candidate and reference text are
very short candidates form receiving too high score.
counted in such a way that a token is never counted
And if a candidate receives high score then it must
twice. On the matching value a Recall and Precision is
match the reference in length, in word choice and in
defined, where
word order. The formula for calculating Brevity Penalty
Recall (Candidate|Reference)=
F-Measure[5] (New York University) - uses
is as follows:
MMS(Candidate, Reference)/ |Reference|
if C > R
MMS(Candidate, Reference)/ |Candidate|
BP =
A reward for longer matches is introduced, this reward
if C <= R
is more for larger matches thus taking care of the fluent
measure of the translation.
Here, C is total length of translation corpus and R is
Final measure
total length of reference corpus. In case there are
& Recall.
multiple references, R is the length of that reference
which is closest to the length of the candidate.
Wordnet to calculate the use of synonym from the
harmonic mean of Precision
Meteor[9] – uses one gram overlaps and uses
reference text. It has a separate module to address
ordering which explains why higher n-grams are not
Example 2.
used. A penalty for reordering is calculated on how
many chunks in the produced text need to be moved
ES: Daman and Diu offers you refreshing holiday.
around to get the reference text.
C : ताज़गीभरी दे ता अवकाश
R1 : दमन एवं द्वीप आपके अवकाश
Case Study : Evaluation of English to Hindi MT
एवं दमन है आपको ।
ताज़गीभरे बना दे ता है ।
R2 : दमन एवं द्वीप तुम्हारे अवकाश ताज़गीभरे बना दे ता है ।
R3 : दमन एवं द्वीप तुम्हारे अवकाश ताज़गीभरे बनाता है ।
By far BLEU is the most used evaluation strategy for
BP = 1.0 , BLEU = 1.0000, M-BLEU=1.0000,
HES = 0.1020
MT. BLEU is mainly designed for the evaluation of
translations between language pairs coming from the
same family like
Spanish-English, French-English,
Example 3.
German-English etc.[7] and it works well in such cases.
But in English-Indian Languages often there is no oneto-one mapping between the source language text and
ES: It was rainning when we left for Goa.
C : जब हम गोआ के लिए ननकिे यह बाररश हो रही थी।
target language text[10]. So this method creates some
R1 : जब हम गोआ के लिए ननकिे बाररश हो रही थी।
times inadequate results. Here we present few examples
of BLEU applied to English-Hindi MT, highlighting
BP = 1.0 , BLEU = 0.4647, M-BLEU=0.5358
HES = 0.6430
inadequacy of BLEU alone for MT evaluation of
English-Indian Languages. The sentences have been
drawn from Tourism domain.
Example 4.
The notation used is :
ES : English Language Sentence(source)
ES: There are some portraits of the rulers of Jodhpur
also displayed at Jaswant Thada.
C : Candidate Sentence(translated sentence)
C : वहां जोधपरु के शासकों के कुछ चित्र भी जसवन्त ठाडा पर
R1, R2, R3 : Reference Sentences
प्रदलशित ककये गये हैं ।
BP :Brevity Penalty
R1 :
जोधपरु के शासकों के कुछ चित्र भी जसवन्त ठाडा पर
प्रदलशित ककये गये हैं ।
जोधपरु के नरे शो की कुछ तसवीरे भी जसवन्त ठाडा पर
M-BLEU : Modified BLEU Score
R2 :
HES : Human Evaluation Score,
प्रदलशित की गयी हैं ।
R3 :
Example 1.
जसवन्त ठाडा पर जोधपरु के नरे शो की कुछ तसवीरे भी
प्रदलशित की गयी हैं ।
BP = 1.0 , BLEU = 0.4901, M-BLEU=0.6692,
HES = 1.000
ES: Daman and Diu offers you refreshing holiday.
C : दमन एवं द्वीप आपको ताज़गीभरी छुट्टियााँ दे ता है ।
R1 : दमन एवं द्वीप आपके अवकाश ताज़गीभरे बना दे ता है ।
R2 : दमन एवं द्वीप तुम्हारे अवकाश ताज़गीभरे बना दे ता है ।
BLEU is mainly designed for the evaluation of
R3 : दमन एवं द्वीप तम्
ु हारे अवकाश ताज़गीभरे बनाता है ।
translations between the language pairs, where there is
BP = 1.0 , BLEU = 0.3097, M-BLEU=0.3578,
HES = 1.000
one to one mapping, but in English-Hindi translation
always it may not be the case. BLEU when applied to
correct word but same word is not used in any of
English-Hindi MT presents the following problems:
the reference then this dictionary will give the
It considers synonyms as different words, thus if
candidate and reference sentences are using
Order checking is not done by BLEU, if shallow
synonyms BLEU will score the translation
parsing is included with the automatic tool then the
low(Example 1 छुट्टियााँ, अवकाश). This is a
tool can also evaluate the candidate on the basis of
weakness in BLEU, not typical of English-Hindi
order of the words.
It does not take care of changes in the order of
