PLIN019 – Machine Translation - Basic information Introduction to

advertisement
PLIN019 – Machine Translation
Basic information
Introduction to Translation
Introduction to Machine Translation
Outline of Machine Translation History
Bibliography
I
John Hutchins – Machine translation: past, present, future
I
John Hutchins – An introduction to machine translation
I
Philipp Koehn – Statistical Machine Translation
I
Sergei Nirenburg et al. – Readings in Machine Translation
I
Jiřı́ Levý – Uměnı́ překladu
I
Jiřı́ Levý – České theorie překladu
Translation I
Translation
Translation is a transfer of a text from a source language to a
target language.
Interpreting
Interpreting is oral translation of spoken language.
Translation I
Translation
Translation is a transfer of a text from a source language to a
target language.
Interpreting
Interpreting is oral translation of spoken language.
Translation is like a woman: either faithful or beautiful.
Translation II
I
technical translation × literary translation
I
exact reproduction × loose translational rephrasing
Maimonidés, 12th century
The context is crucial for translation.
Werner Winter (1923–2010)
Each word is an element pulled out from a complex language
system and its relations to other segments of the system differ
in different languages.
Each meaning (sense) is an element from a complex system of
segments which a speaker divides reality into.
In Mojave language: a woman’s father 6= a man’s father
Which properties of a source should be preserved? –
J. Levý
Translation (Levý)
I
must reproduce
I
I
words of the original
ideas of the original
I
should be able to be read as the original
I
is to be read as a translation
should
I
I
I
I
reflect style of the original
show translator’s style
should be read as a text falling into the period of
I
I
the original
the translator
I
can add or skip something from the original
I
shoud never add or skip something from the original
Translatology
I
deals with translation of texts between languages and
semiotic systems
I
questions of accuracy (fidelity), translatability
I
translation between cultural areas, various periods
I
descriptive branch (critics, history) × applied (practice)
I
formed 60’s–70’s, linguistic orientation
I
80’s – close to theory of literature
I
90’s – turned to a translator him/her-self
Translator
What should a good translator know: (Levý):
I
source language
I
target language
I
factual content of the text: facts of a period, the field
(domain, for technical translation)
Levý on artistic translation
Translation should give the impression of a work of art.
Machine translation and artistic translation – Levý
Machine Translation’s goal is to fragment a sentence to the
simplest comparable elements; artistic translation’s goal is the
opposite: transfering of the highest units.
Types of translations according to Roman Jakobson
I
interlingual – transfer between different languages
I
intralingual – transfer within a language, e.g. to a different
dialect, to a standard language etc.
I
intersemiotic – transfer between different semiotic
systems (sign language)
Questions
I
Is accurate translation between languages possible at all?
I
What is easier: to translate from or to your mother tongue?
I
How we know w1 is translational equivalent of w2 ?
I
English wind types: airstream, breeze, crosswind, dust
devil, easterly, gale, gust, headwind, jet stream, mistral,
monsoon, prevailing wind, sandstorm, sea breeze, sirocco,
southwester, tailwind, tornado, trade wind, turbulence,
twister, typhoon, whirlwind, wind, windstorm, zephyr
I
How should we translate words like alkáč, večernı́ček,
telka, čoklbuřt, knı́žečka, ČSSD . . . ?
I
And what about: matka, macecha, mamka, máma,
maminka, matička, máti, mama, mamča, mamina
I
Navajo Code movie – language as a cipher
Linguistic relativity
I
language properties substantially affect our view of the
world
I
properties of different languages differ significantly
I
→ their speakers live in different, incompatible worlds
Ludwig Wittgenstein
The limits of my language mean the limits of my world.
Fritz Mauthner
If Aristotle had spoken Chinese or Dakota he would have
arrived at a totally different logic.
Linguistic relativity – dualism
I
mould theories: language and thinking are the same, we
think in our language
I
cloak theories: language is on surface, behind is a
complex maze of thoughts
Where linguistic relativity belongs?
On intelligence – Jeff Hawkins (see TED)
Le Ton beau de Marot – Douglas Hofstadter (Jabberwocky,
Žvahlav, palindroms)
Sapir-Whorf hypotesis
I
important theory of psycholinguistics
I
language determines thought
I
30’s of 20th century, Edward Sapir, from linguistic relativity
I
comparison of concepts in American-Indian and European
languages
I
elaborated by Benjamin Lee Whorf
I
later criticized: falsifiable form of the hypothesis (concepts
for colours) showed the opposite to be true
Machine Translation – definition
A discipline of computational linguistics dealing with design,
implementation and application of automatic systems
(software) for translating texts with minimal human invervention.
E.g. a translation with an electronic dictionary does not belong
to machine translation.
Machine translation – object of study
We consider only technical / specialized texts:
I
web pages
I
technical manuals
I
scientific documents, papers
I
leaflets, catalogues
I
law texts
I
in general: texts from narrow domains
Nuances on different language levels in art literature are out of
scope of current MT systems.
Machine translation – problems
In fact an output of MT is always revised. We distinguish
pre-editing and post-editing.
Sometimes necessary even for human.
MT systems make different types of errors.
These mistakes are typical for human:
I
wrong prepositions: (I am in school)
I
missing determiners (I saw man)
I
wrong tense (Uviděl jsem – I was seeing), . . .
For computers, errors in meaning are typical:
Kiss me honey. → Polib mi med.
Lexical choice
A choice of a proper translational equivalent:
I
homonymy – pila, baby, ženu; byte, ate
I
polysemy – take, run, line; klı́č, kohout, mı́t
I
synonymy – kluk, chlapec, hoch; dı́vka, holka, děvče
Word order I
Word order II – free word order
Word order rule
The more morphologically rich the freer word order
Katka snědla kousek koláče.
I
Kati megevett egy szelet tortát → Katie eating a piece of cake
I
Egy szelet tortát Kati evett meg → Katie ate a piece of cake
I
Kati egy szelet tortát evett meg → Katie ate a piece of cake
I
Egy szelet tortát evett meg Kati → Katie ate a piece of cake
I
Megevett egy szelet tortát Kati → Katie eating a piece of cake
I
Megevett Kati egy szelet tortát → Katie ate a piece of cake
Direct methods for improving MT quality
I
limit input to a:
I
I
I
I
sublanguage (indicative sentences)
domain (informatics)
document type (patents)
text pre-processing (e.g. manual syntactic analysis)
Basic terms
I
accuracy (precision)
I
intelligibility
I
fluency
I
source language, SL, L1
I
target language, TL, L2
I
corpus, corpora
I
ambiguity, polysemy
I
...
Classification based on approach
I
rule-based, knowledge-based – RBMT, KBMT
I
I
transfer
with interlingua
I
statistical machine translation – SMT
I
hybrid machine translation – HMT, HyTran
Vauquois’s triangle
Interlingua
e
ag
rc
e
Syntactic transfer
n
tio
ra
ne
lan
ge
gu
e
ag
So
u
gu
an
aly
an
tl
sis
e
rg
Ta
Semantic transfer
Direct
Classification based on interaction with a user
I
(human, manual translation)
I
machine-aided human translation – MAHT
I
human-aided machine translation – HAMT
I
fully automated high-quality (M)T – FAHQT
HAMT and MAHT: CAT – computer-aided translation.
Classification according to direction and arity
Arity:
I
bilingual systems
I
multilingual systems
Direction:
I
unidirectional
I
bidirectional
Systems of Machine Translation
Apertium (RBMT, open-source), Babelfish (Yahoo), Caitra
(CAT system), ČESILKO (Czech-Slovak translation), EuroTra
(ambicious project EC), Google Translate, Logos
(OpenLogos, one of the oldest MT systems), METEO
(translation of weather forecasts, English, French), Moses
(open-source MT system), Pangloss (example-based MT),
Rosetta (contains a logic analysis of propositions), Systran
(one of the oldest MT systems), Trados (translation memory,
CAT system), Verbmobil (translation of speech↔speech
among German, English and Japanese), matecat
(open-source online CAT system), . . .
Conferences, workshops, institutions
I
ACL – Annual meetings of the Association for
Computational Linguistics
I
NIST – National Institute of Standards and Technology
I
Translating and the Computer (London)
I
RANLP – Recent Advances in Natural Language
Processing
I
Workshop on Machine Translation (WMT)
I
The Conference of the Association for Machine Translation
in the Americas
I
LREC – Language Resources and Evaluation Conferences
I
www.wikicfp.com
(Electronic) resources
I
links on PLIN019 web page
I
MT Archive
I
www.statmt.org
I
ACL Anthology
Institutions
I
IAMT – International Association for Machine Translation:
I
I
I
EAMT – European Association for Machine Translation
AMTA – The Association for MT in the Americas
AAMT – The Asian-Pacific Association for MT
I
META-NET – unites European MT departments
I
British Computer Society Natural Language Translation Group
I
UK MFF ÚFAL
I
Obec překladatelů (art literature translators)
I
Jednota tlumočnı́ků a překladatelů
I
Ústav translatologie, FF UK
Motivations for MT
I
period of information boom
I
I
I
I
I
1922 – regular BBC radio broadcast
1923 – radio broadcast in CR
1936 – regular BBC TV broadcast
1953 – TV broadcast in CR
computer development
I
I
generation zero – Z1–3, Colossus, ABC, Mark I,II
first generation – ENIAC (Electronic Numerical Integrator
And Computer, 1945), MANIAC
In 1947 RAM could store 100 numbers and a + b took 1/8 s!
Early MT believes
I
translation is repeated activity – it was believed that it can
be superseded by computers
I
computers were successful in deciphering war codes:
would they be useful also for MT?
Warren Weaver
When I look at an article in Russian, I say: This is really written
in English, but it has been coded in some strange symbols.
I will now proceed to decode.
First impulses
In 1950 Weaver sended a memorandum to 200 addressees in
which he outlined some problems of MT.
I
polysemy (ambiguity) is a common phenomenon
I
intersection of logic and language
I
connections with cryptography
I
universal properties of languages
An early interest in MT held at several departments. At first at
University of London (Andrew D. Booth). Soon after at MIT,
University of Washington, University of California, Harvard, . . .
Topics and first exchanges of experience
I
morphologic and syntactic analysis
I
meaning and knowledge representation
I
creating and working with electronic dictionaries
I
1952 – first public conference at MIT
I
1954 – first showcase of a working MT
Alan Turing Test
Using language as humans do is a sufficient operational test for
intelligence.
Georgetown experiment
The first working prototype of MT.
I
IBM, New York
I
first public demonstration of MT
I
a computer applied to a non numerical task
I
over 60 sentences (probably carefully selected)
I
a dictionary with 250 words
I
from Russian to English
I
grammar for Russian contained 6 rules
The experiment provoked enthusiasm. MT was obviously
possible (despite fraudulently presented). Many new projects
aroused after, mainly in USA, Russia.
Progress in 50’s
I
MT provoked development in these fields:
I
I
I
theoretical linguistics (Chomsky)
computational linguistics
artificial intelligence
I
with higher coverage quality of MT decreased
I
even the best systems (GAT, Georgetown, Ru→En)
provided unsatisfying results
I
generating random love poems (1952)
Progress in 50’s
I
a first PhD thesis on MT defended (1954)
I
Journal of Machine Translation (1954)
I
First international MT conference held in London (1956)
I
Noam Chomsky: Syntactic Structures (1957)
I
MT research in USSR, Japan
I
first book about MT (Introduction), Paris (1959)
60’s, Disappointments from poor results
I
despite rather poor results, optimism prevailed
I
Yehoshua Bar-Hillel wrote a critics of MT status in 1959
I
he claimed computers are not capable of lexical
disambiguation
I
fully automated high-quality translation (FAHQT)
unreachable
Yehoshua Bar-Hillel – an example for disambiguation
Little John was looking for his toy box. Finally, he found it. The
box was in the pen. John was very happy.
MT projects expenses began to decrease.
Progress in 60’s
I
MT in USSR focused on En scientific paper abstracts
I
Association for MT in USA (1962)
I
Peter Toma leaves Georgetown MT, develops AUTOTRAN,
later Systran
ALPAC report, 1966
I
Automatic Language Processing Advisory Commitee
I
an institution under U.S. National Academy of Science
I
it carried out analyses and evaluations of MT quality and
usability
I
recommended to reduce expenditures for MT support
I
negative impact on MT as a scientific field
I
a problem was in strong underestimation of complexity of
natural language understanding
I
MT development continued in Europe, USSR, Japan
continuously
I
it took MT in USA another 15 years to regain its previous
respect and status
TAUM, METEO
TAUM
I
Traduction Automatique à l’Université de Montréal
I
Université de Montréal in 1965
I
prototypes of MT systems: TAUM-73, TAUM-METEO
I
first MT systems incorporating analysis of SL and
synthesis of TL
I
EN → FR
I
TAUM Aviation (cancelled)
METEO
I
1981–2001 used for weather forecast translation
I
author John Chandiou, Canada
Systran
I
one of the oldest MT companies (1968)
I
very popular translation system
I
basis for Yahoo Babelfish
I
until 2007 used even by Google
I
RBMT, since 2010, hybrid translation
I
from 1976 oficial MT system used by EEC
Renaissance – 70’s
I
First Soviet MT program: AMPAR (En→Ru)
I
Systran installed at EC (1978)
I
Xerox uses Systran
I
a project proposes using Esperanto as interlingua
(refused)
Renaissance – 80’s
I
development of rule-based systems with interlingua
I
Rosetta project started (1980, logical interlingua)
I
first data-driven systems (Example-based MT)
I
boom of commercial MT systems
I
EUROTRA project (EU funded) began
I
IBM introduces 8-bit ASCII (1983)
I
Trados – the first company to develop CAT, Stuttgart
I
Unicode project (1987)
I
World Wide Web proposal (1989)
Renaissance – 90’s
I
research on statistical MT (IBM)
I
SDL (CAT market leader) founded in UK (1992)
I
Verbmobil project (1992–99)
I
rule-based systems still dominating the field
I
AltaVista Babelfish (1997), 500,000 requests/day
I
first online commercial online MT service iTranslator
Renaissance – 00’s
I
statistical systems dominate the field
I
quality of rule-based systems improved by statistical
methods (hybrid systems)
I
new translation pairs
I
NIST launches first round of MT system benchmarking
(2001)
I
EuroMatrix – a large scale EC funded project (2006)
I
Moses (open source statistical MT engine, 2007)
Too optimistic prognosis
Machine Translation nowadays I
I
unprecedented computational power, data structures
I
enabled work with billion words instantly
Google 1 PB sort (2008)
I
I
I
I
I
trillion 100 B records
6 hours; 4,000 PCs; 48,000 discs
MapReduce technique
Google Ngrams
I
development of MT systems for everyone
I
number of parallel corpora steadily increasing
I
focus on under-resourced languages (LREC)
I
MT quality is improved slowly but steadily
Machine Translation nowadays II
I
SMT rulezz
I
intense parallel (and comparable) data acquiring
I
development of MT systems based on evaluation metric
outputs
I
USA: interest mainly in English as TL
I
EU: translation between 23 oficial languages of EU
(EuroMatrix): English, Bulgarian, Czech, Danish, Estonian, Finnish,
French, Irish, Italian, Lithuanian, Latvian, Hungarian, Maltese, German,
Dutch, Polish, Portugese, Romanian, Greek, Slovak, Slovene, Spanish
a Swedish.
Machine translation nowadays III
I
big companies (Microsoft) focused on English as SL
I
large pairs (En↔Sp, En↔Fr): very good translation quality
I
SMT enriched with syntax
I
Google Translate as a gold standard
I
morphologically rich languages neglected
I
En↔* a *↔En pairs prevail
Motivation in 21st century
I
translation of web pages for gisting (grasping the main
message)
I
methods for speeding-up human translation substantially
(translation memories)
I
cross-language extraction of facts and search for
information
I
instant translation of e-communication
I
translation on mobile devices
Conclusion
I
MT falls into AI-complete problems
I
immense computational power at our disposal
I
commercial (market) potential is bigger than ever
I
there is always a thing to be improved in MT
I
statistical methods seem to be more convenient (fast,
cheap)
I
new ideas most welcome! (theses)
Download