LING 138 Intro to Computer Speech and Language Processing

advertisement
LIN6932
Topics in Computational Linguistics
Lecture 9: Machine Translation
Hana Filip
3/22/07
LIN 6932
1
Outline for MT Week
• Intro and a little history
• Language Similarities and Divergences
• Three classic MT Approaches
– Transfer
– Interlingua
– Direct
• Modern Statistical MT
• Evaluation
3/22/07
LIN 6932
2
What is MT?
• Translating a text from one language to another
automatically.
Fully automatic translation lies at one end of the scale
and the work of the human translator armed with pencil
and paper at the other.
Between them are a number of possibilities for
collaboration between man and computer which include
word processing, terminology databases, voice
recognition and translation memory systems.
3/22/07
LIN 6932
3
Machine Translation
• dai yu zi zai chuang shang gan nian bao chai you ting jian
chuang wai zhu shao xiang ye zhe shang, yu sheng xi li,
qing han tou mu, bu jue you di xia lei lai.
• Dai-yu alone on bed top think-of-with-gratitude Bao-chai
again listen to window outside bamboo tip plantain leaf of
on-top rain sound sigh drop clear cold penetrate curtain
not feeling again fall down tears come
• As she lay there alone, Dai-yu’s thoughts turned to Baochai… Then she listened to the insistent rustle of the rain
on the bamboos and plantains outside her window. The
coldness penetrated the curtains of her bed. Almost
without noticing it she had begun to cry.
3/22/07
LIN 6932
4
Machine Translation
3/22/07
LIN 6932
5
Machine Translation
• The Story of the Stone also called
The Dream of the Red Chamber (Cao Xueqin 1792)
• Issues:
– Word segmentation, word order
– Sentence segmentation: 4 English sentences to 1 Chinese
– Grammatical differences
• Chinese does not grammatically mark tense on the verb:
– additional words and tense marking used in E: as, turned to, had begun
– tou -> penetrated
• Zero anaphora: a gap, in a phrase or clause, that has an anaphoric function similar
to a pro-form
• No articles
– Stylistic and cultural differences
• Bamboo tip plaintain leaf -> bamboos and plantains
• Ma ‘curtain’ -> curtains of her bed
• Rain sound sigh drop -> insistent rustle of the rain
3/22/07
LIN 6932
6
Not just literature
Hansard: Canadian parliamentary proceeedings
Hansard is the traditional name for the printed transcripts of parliamentary debates in the
Westminster system of government.
3/22/07
LIN 6932
7
Canadian Hansard and MT
• The bilingual nature of the Canadian federal government
requires that two equivalent Canadian Hansards be maintained:
one in French and one in English.
• This makes it a natural parallel text, and it is often used to train
French-English machine translation programs.
• In addition to being already translated and aligned, the size of
the Hansards and the fact new material is always being added
makes it an attractive corpus.
• Problem: translations are accurate in meaning, but they are not
always literally exact
3/22/07
LIN 6932
8
What is MT not good for?
• Really hard stuff
– Literature
– Natural spoken speech (meetings, court reporting)
• Really important stuff
– Medical translation in hospitals
3/22/07
LIN 6932
9
What is MT good for?
• Tasks for which a rough translation is fine
– Web pages, email
• Tasks for which MT can be post-edited
– MT as first pass
– Computer-aided human translation
• Tasks in sublanguage domains where high-quality
MT is possible
– FAHQT [faah-quit]
(Fully Automatic High-Quality Translation)
3/22/07
LIN 6932
10
Sublanguage domain
• Weather forecasting
– “Cloudy with a chance of showers today and Thursday”
– “Low tonight 4”
• Can be modeled completely enough to use raw MT
output
• Word classes and semantic features like
MONTH, PLACE, DIRECTION, TIME POINT
3/22/07
LIN 6932
11
Some MT History
• 1946 Warren Weaver (mathematician, and science administrator,
director of the Division of Natural Sciences at the Rockefeller
Foundation, 1932-55) and Andrew D. Booth (British
crystallographer) first discuss the possibility of MT in New
York
• 1947-48 idea of dictionary-based (word-for-word) direct
translation
• 1949 Weaver’s “Translation” memorandum popularized the MT
idea
3/22/07
LIN 6932
12
Some MT History
1949 Weaver’s “Translation” memorandum popularized the MT idea
• limitations of any simplistic word-for-word approach
• four proposals:
– Approach the problem of multiple meanings by the examination of immediate
context
– logical elements in language
– cryptographic methods were possibly applicable
– linguistic universals:
“Think, by analogy, of individuals living in a series of tall closed towers, all
erected over a common foundation. When they try to communicate with one
another, they shout back and forth, each from his own closed tower. It is difficult
to make the sound penetrate even the nearest towers, and communication
proceeds very poorly indeed. But, when an individual goes down his tower, he
finds himself in a great open basement, common to all the towers. Here he
establishes easy and useful communication with the persons who have also
descended from their towers.”
3/22/07
LIN 6932
13
Some MT History
• 1952: the first MT (“mechanical translation”) conference held
at MIT, 18 MT researchers
• 1954 First public demo of computer translation at Georgetown
University: 49 Russian sentences are translated into English
using a 250-word vocabulary and 6 grammar rules.
• 1955-65 a number of labs take up MT
3/22/07
LIN 6932
14
History of MT: Pessimism
• 1959/1960: Bar-Hillel “Report on the state of MT in US and GB”
– Argued FAHQT too hard (semantic ambiguity, etc)
– Should work on semi-automatic instead of automatic translations
– His argument
Little John was looking for his toy box. Finally, he found it. The box was
in the pen. John was very happy.
– Only human knowledge let’s us know that ‘playpens’ are bigger than
boxes, but ‘writing pens’ are smaller
– His claim: in order for MT to succeed, we would have to encode all of
human knowledge
3/22/07
LIN 6932
15
Bar-Hillel’s report 1959/1960
MT research – now a “multimillion dollar affair”,
as he pointed out – was, with few exceptions, set
on a mistaken and unattainable goal, namely, fully
automatic translation of a quality equal to that of a
good human translator. This he held to be utterly
unrealistic, and in his view resources were being
wasted which could be more fruitfully be devoted
to the development of less ambitious and more
practical computer aids for translators.
3/22/07
LIN 6932
16
History of MT: Pessimism
1966
The ALPAC (Automatic Language Processing Advisory Committee) report
– Headed by John R. Pierce of Bell Labs
– Main Conclusion: years of research produced no useful results,
all current MT work had to be post-edited
• Supply of human translators exceeds demand
• All the Soviet literature is already being translated
• Sponsored evaluations which showed that intelligibility and informativeness
was worse than human translations
– Results:
• MT research suffered
– halt in federal funding for machine translation in the US
– Number of research labs declined
– Association for Machine Translation and Computational Linguistics
dropped MT from its name
3/22/07
LIN 6932
17
History of MT
• 1977 METEO System, developed at the Université de Montréal was installed
in Canada to translate weather forecasts from English to French
• 1968 Systran founded by Peter Toma, one of the oldest machine translation
companies, extensive work for the United States Department of Defense and
the European Commission, provides the technology for Yahoo!, AltaVista's
(Babel Fish) and Google's online translation services, among others
• 1970’s:
– European focus in MT; mainly ignored in US
• 1980’s
– ideas of using AI techniques in MT (Knowledge-based MT at Carnegie
Mellon University)
• 1990’s
– Commercial MT systems
– Statistical MT
– Speech-to-speech translation
3/22/07
LIN 6932
18
Language Similarities and
Divergences
• Some aspects of human language are
universal or near-universal, others diverge
greatly.
• Typology: the study of systematic crosslinguistic similarities and differences
• What are the dimensions along with human
languages vary?
3/22/07
LIN 6932
19
Morphological Variation
• Number of morphemes per word:
– Isolating languages
• Cantonese, Vietnamese: each word generally has one morpheme
– Polysynthetic languages
• Siberian Yupik (`Eskimo’): single word may have very many
morphemes
• Degree to which morphemes are segmentable:
– Agglutinative languages
• Turkish: morphemes have clean boundaries
– Vs. Fusion languages
• Russian: single affix conflate more than one grammatical categories
(e.g., a case suffix fuses number, gender and case)
3/22/07
LIN 6932
20
Syntactic Variation
• SVO (Subject-Verb-Object) languages
– English, German, French, Mandarin
• SOV Languages
– Japanese, Hindi
• VSO languages
– Irish, Classical Arabic
• SVO lgs generally prepositions: to Yuriko
• VSO lgs generally postpositions: Yuriko ni
3/22/07
LIN 6932
21
Segmentation Variation
• Not every writing system has word boundaries
marked by visual cues for
– Chinese, Japanese, Thai, Vietnamese
• Some languages tend to have sentences that are
quite long, closer to English paragraphs than
sentences:
– Modern Standard Arabic, Chinese
3/22/07
LIN 6932
22
Lexical Divergences
• Word to phrases:
– English “computer science” = French “informatique”
• POS divergences
– English ‘she likes/VERB to sing’
German ‘Sie singt gerne’/ADV
– English ‘I’m hungry’/ADJ
Italian ‘Ho fame’/NOUN
3/22/07
LIN 6932
23
Lexical Divergences: Specificity
• Grammatical constraints
– English has gender on pronouns, Mandarin not.
• So translating “3rd person” from Chinese to English, need to figure out
gender of the person!
• Similarly from English “they” to French “ils” (masc. plural) or “elles”
(feminine plual)
• Semantic constraints
– English ‘brother’
Mandarin ‘gege’ (older) versus ‘didi’ (younger)
– English ‘wall’
German ‘Wand’ (inside) vs. ‘Mauer’ (outside) cp. die Berliner Mauer
– German ‘Berg’
English ‘hill’ or ‘mountain’
3/22/07
LIN 6932
24
Lexical Divergence:
many-to-many
3/22/07
LIN 6932
25
Lexical Divergence: lexical gaps
• Japanese: no word for ‘privacy’
• English (and other languages): no single word for German
‘Schadenfreude’ (the enjoyment of another’s misfortune)
• English: no word for Japanese ‘oyakoko’ (something like
‘filial piety’)
• English ‘blue’ versus Russian ‘sinevoy’ (dark blue) and
‘goluboy’ (light blue)
3/22/07
LIN 6932
26
Lexicalization Patterns
divergences
Leonard Talmy (1985)
“Lexicalization patterns: Semantic structure in lexical forms”
• English
The bottle floated out.
– Manner of motion lexicalized in the verb
– Direction of motion lexicalized in the ‘satellite’ (here V particle)
• Spanish
La botella salió flotando.
Lit: the bottle exited floating
– Manner of motion lexicalized in the gerund
– Direction of motion lexicalized in the verb
3/22/07
LIN 6932
27
Lexicalization Patterns
divergences
• Verb-framed lg: mark direction of motion on verb
– Romance, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan,
Bantu familiies
• Satellite-framed lg: mark direction of motion on satellite
– Crawl out, float off, jump down, walk over to, run after
– Rest of Indo-European (e.g., Germanic, Slavic), Hungarian,
Finnish, Chinese
3/22/07
LIN 6932
28
Structural divergences
• German: Wir treffen uns am Mittwoch
• English: We’ll meet on Wednesday
3/22/07
LIN 6932
29
Thematic divergence
• German: Mir fällt der Termin ein
• English: I remember the date
3/22/07
LIN 6932
30
MT on the web
• Babelfish:
– http://babelfish.altavista.com/
• Google:
– http://www.google.com/search?hl=en&lr=&clie
nt=safari&rls=en&q="1+taza+de+jugo"+%28z
umo%29+de+naranja+5+cucharadas+de+azuca
r+morena&btnG=Search
3/22/07
LIN 6932
31
3 methods for MT
• Direct
• Transfer
• Interlingua
3/22/07
LIN 6932
32
Three MT Approaches: Direct,
Transfer, Interlingual
3/22/07
LIN 6932
33
Direct Translation
•
•
•
•
Proceed word-by-word through text
Translating each word
No intermediate structures except morphology
Knowledge is in the form of
– Huge bilingual dictionary
– word-to-word translation information
• After word translation, can do simple reordering
– Adjective ordering English -> French/Spanish
3/22/07
LIN 6932
34
Direct MT
3/22/07
LIN 6932
35
Problems with direct MT
• German
Complex reordering of words and phrases are necessary
3/22/07
LIN 6932
36
The Transfer Model
• Idea:
Starting from a structural analysis, use rules about
differences between languages to translate directly from one
surface structure to another: syntactic transformations (adjusting
word order) and lexical transfer (selecting equivalents).
• Steps:
– Analysis: Syntactically parse Source language
– Transfer: Rules to turn this parse into parse for Target
language
– Generation: Generate Target sentence from parse tree and
lexical transfer via lookup in the bilingual dictionary
3/22/07
LIN 6932
37
English to French
• Generally
– English: Adjective Noun
– French: Noun Adjective
– Note: not always true
• Route mauvaise ‘bad road, badly-paved road’
• Mauvaise route ‘wrong road’
• But is a reasonable first approximation
– Rule:
3/22/07
LIN 6932
38
Transfer rules
From English SVO to Japanese SOV
3/22/07
LIN 6932
39
Transfer rules
3/22/07
LIN 6932
40
Lexical transfer
•
•
•
•
Transfer-based systems also need lexical transfer rules
Bilingual dictionary (like for direct MT)
English home (lexical ambiguity)
German
– nach Hause (going home)
– Heimat (homeland, home country)
– zu Hause (at home)
• Can list “at home <-> zu Hause”
• Or do Word Sense Disambiguation
3/22/07
LIN 6932
41
Systran:
combining direct and transfer
• Shallow syntactic parsing
– Morphological analysis, POS tagging
– Chunking of NPs, PPs, phrases
– Shallow dependency parsing (subjects, passives, head-modifiers)
• Transfer
– Translation of idioms
– Word sense disambiguation
– Assigning prepositions based on governing verbs
• Synthesis
– Apply rich bilingual dictionary
– Deal with reordering
– Morphological generation
3/22/07
LIN 6932
42
Transfer: some problems
• A distinct set of transfer rules for each pair of
languages
• Grammar and lexicon full of languageidiosyncratic generalizations
• Hard to build, hard to maintain
3/22/07
LIN 6932
43
Interlingua
• Intuition: Instead of lg-lg knowledge rules,
use the meaning of the sentence to help
• Steps:
– translate source sentence into meaning
representation
– generate target sentence from meaning
3/22/07
LIN 6932
44
Interlingua for
Mary did not slap the green witch
3/22/07
LIN 6932
45
Direct MT: pros and cons (Bonnie Dorr)
• Pros
–
–
–
–
Fast
Simple
Cheap
No translation rules hidden in lexicon
• Cons
–
–
–
–
–
3/22/07
Unreliable
Not powerful
Rule proliferation
Requires lots of context
Major restructuring after lexical substitution
LIN 6932
46
Interlingual MT: pros and cons (B. Dorr)
• Pros
– Avoids the proliferation of specific rules
– Easier to write rules
• Cons:
– Semantics is HARD
– Useful information lost (paraphrase)
3/22/07
LIN 6932
47
What makes a good translation
• Translators often talk about two factors we want to
maximize:
• Faithfulness or fidelity
– How close is the meaning of the translation to the
meaning of the original
– (Even better: does the translation cause the reader to
draw the same inferences as the original would have)
• Fluency or naturalness
– How natural the translation is, just considering its
fluency in the target language
3/22/07
LIN 6932
48
The impossibility of translation
Hebrew “adonai roi” (= The Lord is my Shepherd)
How do you translate it into a language whose culture has
no sheep or shepherds
– Something fluent and understandable, but not faithful:
• “The Lord will look after me”
– Something faithful, but not fluent and natural
• “The Lord is for me like somebody who looks after animals
with cotton-like hair”
3/22/07
LIN 6932
49
Statistical MT:
Faithfulness and Fluency formalized
• Best-translation of a source sentence S into
the target sentence T:
Tˆ  argmax T fluency (T)faithfulness (T,S)
– Idea: build probabilistic models of faithfulness
and fluency, and then combine these models to
choose the most probable (= best) translation
3/22/07
LIN 6932
50
The IBM model
• those two factors might look familiar…
Tˆ  argmax fluency (T)faithfulness (T,S)
T
• Yup, it’s Bayes rule:


Tˆ  argmax T P(T)P(S | T)
3/22/07
LIN 6932
51
Noisy channel model for statistical
MT
Idea:
Statistical machine translation (MT) typically
takes as its basis a noisy channel model in which
the target language sentence, by tradition
labelled E, is seen as distorted by the channel
into the foreign language F.
3/22/07
LIN 6932
52
Noisy channel model for MT
Background:
The Shannon-Weaver Model of Communication
3/22/07
LIN 6932
53
Noisy channel model for MT
Idea:
Assume that the foreign (source language) input F we
must translate into English is a corrupted version of
some English (target language) sentence E, and that
our task is to discover the hidden (target language)
sentence E that generated our observation sentence F.
Hidden Markov Model
3/22/07
LIN 6932
54
Noisy channel model for MT
Given a Spanish sentence to translate (source L sentence), we treat
it as the output of an English sentence (target L sentence) having
gone through the noisy channel, and search for the best possible
‘source’ English sentence: I.e., the probability of the foreign
sentence F given the existence of E: P(F|E)
3/22/07
LIN 6932
55
More formally
• Assume we are translating from a foreign language sentence F to
an English sentence E:
F = f1, f2, f3,…, fm
• We want a decoder which is given F and produces the most
probable (= best) English sentence
E-hat = e1, e2, e3,…, en
E-hat = argmaxE P(E|F)1
= argmaxE P(F|E)P(E)/P(F)2
Bayes rule
= argmaxE P(F|E)P(E)
Translation Model
Language Model
1 The conditional probability of an English sentence E, given a foreign sentence F
2 We can ignore the denominator P(F) inside the argmax since we are choosing the best English
sentence for a fixed foreign sentence F, and hence P(F) is a constant.
3/22/07
LIN 6932
56
More formally
argmaxE P(E|F) = argmaxE P(F|E)P(E)
•
•
•
This equation leaves much unresolved concerning how the
actual translation is to be performed, systems that
presuppose it are derived from the early IBM Models
originally designed for speech recognition at IBM and work
at the word level.
Called the IBM model of MT
the translation process involves translating words and then
rearranging them to recover the target language sentence.
3/22/07
LIN 6932
57
Fluency: P(T)
• How to measure that this sentence
– That car was almost crash onto me
• is less fluent than this one:
– That car almost hit me.
• Answer: language models (N-grams!)
– For example P(hit|almost) > P(almost|was)
• But we can use any other more sophisticated
model of grammar
• Advantage: this is monolingual knowledge!
3/22/07
LIN 6932
58
Faithfulness: P(S|T)
probability that each word in target sentence would generate
each word in source sentence.
• French: ça me plait [that me pleases]
• English:
–that pleases me - most faithful
–I like it - most fluent
• How to quantify faithfulness?
• Intuition: degree to which words in one sentence are
plausible translations of words in other sentence
3/22/07
LIN 6932
59
Faithfulness P(S|T)
• Need to know, for every target language word,
probability of it mapping to every source language
word.
• How do we learn these probabilities?
• Parallel texts!
– two texts that are translations of each other
3/22/07
LIN 6932
60
Word Alignment
– All statistical translation models are based on
the idea of a word alignment
– French - English word alignment
3/22/07
LIN 6932
61
Word Alignment
• The IBM models require that each French word comes from exactly one
English word: one-to-one and one-to-many alignments sanctioned
• Many-to-many and many-to-one alignments disallowed by basic MT models
• We can represent the above alignment by giving the index number of the
English word that the French word comes from: A = 2,3,4,5,6,6,6.
3/22/07
LIN 6932
62
Word Alignment
TRAINING ALIGNMENT MODELS
• All statistical translation models are trained using
a large parallel corpus.
• A parallel corpus, parallel text, or bitext is a text
that is available in two languages.
• For example, the proceedings of the Canadian
parliament are kept in both French and English.
Each sentence spoken in parliament is translated,
producing a volume with running text in both
languages.
3/22/07
LIN 6932
63
Word Alignment
• First step: Sentence alignment
– Figuring out which source language sentence maps to
which target language sentence
• Second step: Word alignment
– Figuring out which source language word maps to
which target language word for each sentence pair (F,
E).
3/22/07
LIN 6932
64
Back to Faithfulness and Fluency
• Job of the faithfulness model P(S|T) is to model “bag of
words”; e.g., which words align from English to Spanish, when
translating from Spanish to English.
• P(S|T) does not have to worry about lg particular facts about
Spanish word order: that’s the job of P(T) (language model)
• P(T) can do Bag generation: rearrange the words so that they
recover the correct word order of the target sentence (from
Kevin Knight, USC/Information Sciences Institute)
3/22/07
LIN 6932
65
P(T) and bag generation:
problem
• Problem: the ‘bag of words’ statistical MT
does not model relations among words
• How about:
– loves Mary John
3/22/07
LIN 6932
66
Phrase-Based MT
Recently there has been considerable interest in MT systems based
not upon words, but rather syntactic phrases
Such MT systems perform the translation by assuming that during
the training phase the target language (but not the source language)
specifies not just the words, but rather the complete parse of the
sentence.
Eugene Charniak, Kevin Knight and Kenji Yamada (2003)
“Syntax-based Language Models for Statistical Machine
Translation”
3/22/07
LIN 6932
67
Download