An Introduction to Machine Translation

advertisement
An Introduction to Machine Translation
Andy Way, DCU
The Rise & Fall of Different MT Paradigms
Three main approaches to RBMT
language-neutral interlingua
TRANSFER
GENERATION
ANALYSIS
direct translation
source text
The Vauquois Pyramid
target text
System Design: Concerns
Multilingual vs. Bilingual
Multilingual:
Extreme: Eurotra, i.e. 72 language pairs Modest: EN DE,FR,ES,
i.e. 3 language pairs
Intermediate: EN,FR,DE,ES,JP, but not all combinations
Bilingual:
Unidirectional vs. Bidirectional
ENFR or FREN
Reversible vs. Non-reversible
ENFR, same EN,FR components for Analysis & Generation, and
reversible transfer module
ENFR & FREN, but different EN, FR components for Analysis &
Generation, and different transfer modules, NB, lack of modularity …
Direct vs. Transfer vs. Interlingua
Batch vs. Interactive
Advantages/Disadvantages of Direct Systems
Advantages
Engine's competence lies in its comparative grammar.
Highly robust. Does not break down or stop when
it encounters unknown words, unknown grammatical
constructs, or ill-formed Input
Designed for unidirectional translation between one pair of
langs. Not conducive to genuine multilingual MT design.
Disadvantages
‘word-for-word' translation + local reordering = poor
translation, using cheap bilingual dictionary & rudimentary
knowledge of target language.
Linguistically, computationally naive. No analysis of internal
structure of Input, especially w.r.t. the grammatical relationships
between the main parts of sentences.
Advantages/Disadvantages of Interlingual Systems
Advantages
Intermediate representation (IR) fully specified, i.e. no need to
‘look back' at Source in order to generate Target.
Easy to extend to other langs.
Built-in back translation: useful for testing.
Disadvantages
How to define an Interlingua for closely related languages?
Truly universal Interlingua possible?
Advantages/Disadvantages of Transfer Systems
Advantages
No language-independent representations: source IR specific to
a particular lang., as is the target lang. IR.
So Complexity of Analysis & Generation components much
reduced …
Also, no necessary equivalence between source and target
IRs for the same language!
Disadvantages
Not so easy to extend to other languages: n analysis modules, n
generation modules, n x n-1 transfer modules, i.e. not much less
than n² …
No guaranteed built-in back translation.
Direct, or Indirect?
Direct:
From manufacturer's viewpoint, better, as it's more robust …
Indirect:
Falls over more easily.
Development phase can be trying.
Commercially, must be supplemented with techniques for dealing with
unseen Input.
What about Translation Quality?
Indirect systems clearly better in principle.
However, constructing MT engine requires considerable effort.
Direct Systems can achieve good performance.
Summary
Research: mostly Transfer-based, with rules automatically acquired
from data
Industrially: we can expect highly-developed Direct Systems to survive
for some years to come …
Other Material
Arnold, D. et al. (1994): Machine Translation - An Introductory
Guide; NCC Blackwell, Oxford
Hutchins, J. & H. Somers (1992): An Introduction to MT; Academic
Press, London
Trujillo, A. (1999): Translation Engines; Springer, London
Newer books include:
Bowker, L. (2002): Computer-Aided Translation Technology, U. of
Ottawa Press.
Somers, H. (2003): Computers and Translation: A translator's guide,
John Benjamins.
Bond, F. (2005): Translating the Untranslatable, CSLI.
Quah, C. (2006): Translation and Technology, Palgrave MacMillan.
Why Corpus-Based MT?
the (relative) failure of rule-based approaches
the increasing availability of machine-readable
text
the increase in capability of hardware (CPU,
memory, disk space) with associated decrease
in cost
Corpus-Based MT is here to stay
These approaches are now mainstream:
Most researchers are developing corpus-based systems;
First company to use SMT now exists:
http://www.languageweaver.com;
CNGL partner Traslán uses EBMT/SMT hybrid;
In recent large-scale evaluations, corpus-based MT systems
come first.
Two caveats:
Most industrial systems are still rule-based (but cf. Google’s
systems now all SMT);
Current mainstream evaluation metrics favour n-gram-based
systems (i.e. bias towards SMT).
Thanks to Kevin Knight …
Centauri/Arcturan Exercise
Slides already on CA446 webpage …
Centauri/Arcturan [Knight, 1997]
Your assignment, put these words in order:
{ jjat, arrat, mat, bat, oloat, at-yurp }
There are 6! different orders possible, so 720 different
translations.
Best order (according to placement in TL side of the
corpus is as given above):
Not just unigrams, but n-grams also …
It’s Really Spanish—English!
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa
1a. Garcia and associates .
1b. Garcia y asociados .
7a. the clients and the associates are enemies .
7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .
2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .
8b. la empresa tiene tres grupos .
3a. his associates are not strong .
3b. sus asociados no son fuertes .
9a. its groups are in Europe .
9b. sus grupos estan en Europa .
4a. Garcia has a company also .
4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .
10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .
5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .
11b. los grupos no venden zanzanina .
6a. the associates are also angry .
6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .
12b. los grupos pequenos no son modernos .
Some more to try …
iat lat pippat eneat hilat oloat at-yurp.
totat nnat forat arrat mat bat.
wat dat quat cat uskrat at-drubel.
Some more to try …
iat lat pippat eneat hilat oloat at-yurp.
totat nnat forat arrat mat bat.
wat dat quat cat uskrat at-drubel.
… if you have trouble sleeping at nights!
What have we just seen?
what parallel corpora look like;
how relevant parallel corpora are for MT;
how to build bilingual dictionaries from parallel corpora;
how cognate information may be useful in MT;
how to do word alignment.
What else do we need to know?
about word alignment on a larger scale;
about phrasal alignment, the norm in real translation
data;
about unknown words;
the importance of knowing the target language (vs.
source) in making fluent translations;
about locality in word order shifts;
how to guess the meanings/translations of unknown
words;
about how much uncertainty the machine faces in
working with limited data;
about working on different domains;
…
Do such methods scale to ‘real’ MT?
Availability of monolingual and bilingual corpora?
Possibility of sentence-aligning bilingual corpora?
Can we write an algorithm to extract the translation
dictionary?
Can we write an algorithm to extract the monolingual
word pair counts?
Can we write an algorithm to generate translations using
our translation dictionary and word pair counts?
Do such methods scale to ‘real’ MT?
Availability of monolingual and bilingual corpora?
Possibility of sentence-aligning bilingual corpora?
Can we write an algorithm to extract the translation
dictionary?
Can we write an algorithm to extract the monolingual
word pair counts?
Can we write an algorithm to generate translations using
our translation dictionary and word pair counts?
WILL THE TRANSLATIONS PRODUCED BE ANY
GOOD?
Parallel Corpora
Hugely important … but not available in a wide
range of language pairs:
Chinese—English: Hong Kong data
French—English: Canadian Hansards
Older EU pairs: Europarl [Koehn 04]
Newer EU pairs: JRC-Acquis Communautaire, very
recently distributed updated Europarl
Arabic—English: LDC Data
NIST, IWSLT, TC-STAR Evaluations
…
Caveat interpres!
Beware of sparse data!
Beware of unrepresentative corpora!
Beware of poor quality language!
If the corpora are small, or of poor quality, or are
unrepresentative, then our statistical language
models will be poor, so any results we achieve
will be poor.
Download