mt1

advertisement
Machine
Translation
Introduction to
MT
Dan Jurafsky
Machine Translation
• Fully automatic
• Helping human translators
Enter Source Text:
这 不过 是 一 个 时间 的 问题 .
Translation from Stanford’s Phrasal:
This is only a matter of time.
Dan Jurafsky
Google Translate
• Fried ripe plantains:
• http://laylita.com/recetas/2008/02/28/platan
os-maduros-fritos/
Dan Jurafsky
Machine Translation
• The Story of the Stone (“The Dream of the Red Chamber”)
• Cao Xueqin 1792
• Chinese gloss: Dai-yu alone at bed on think-of-with-gratitude
Bao-chai… again listen to window outside bamboo tip plantain
leaf of on, rain sound sigh drop, clear cold penetrate curtain, not
feeling again fall down tears come.
• Hawkes translation: As she lay there alone, Dai-yu’s thoughts
turned to Bao-chai… Then she listened to the insistent rustle of
the rain on the bamboos and plantains outside her window. The
coldness penetrated the curtains of her bed. Almost without
noticing it she had begun to cry.
Dan Jurafsky
Difficulties in Chinese to English
translation
• Long Chinese sentences: 4 English sentences to 1 Chinese
• Chinese no pronouns or articles (English the, a)
• Chinese has locative post-positions, English prepositions
• Chinese bed on, window outside, English on the bed, outside the window
• Chinese rarely marks tense:
• English as, turned to, had begun,
• Chinese tou, ‘penetrate’ -> English penetrated
• Chinese relative clauses are before the noun, English after
• Chinese: [window outside bamboo on] rain
• English: rain [on the bamboo outside the window]
• Stylistic and cultural differences
• Chinese bamboo tip plaintain leaf -> bamboos and plantains
• Chinese rain sound sigh drop -> insistent rustle of the rain
• Chinese ma ‘curtain’ -> curtains of her bed
Dan Jurafsky
Alignment in Machine Translation
Dan Jurafsky
Early MT History
1946 Booth and Weaver discuss MT in New York
1947-48 idea of dictionary-based direct translation
1947 Warren Weaver suggests translation by computer
1949 Weaver memorandum
1952 all 18 MT researchers in world meet at MIT
1954 IBM/Georgetown Demo Russian-English MT
1955-65 lots of labs take up MT
http://www.hutchinsweb.me.uk/PPF-TOC.htm
Dan Jurafsky
1949 Weaver memorandum
• http://www.mt-archive.info/Weaver-1949.pdf
• “There are certain invariant properties which are…
common to all languages”
• ‘When I look at an article in Russian, I say "This is really
written in English, but it has been coded in some
strange symbols. I will now proceed to decode.”’
• “[If] one can see… N words on either side, then, if N is
large enough, one can unambiguously decide the
meaning of the central word.”
8
Dan Jurafsky
The History of MT: Pessimism
• 1959/1960
• Yehoshua Bar-Hillel “Report on the state of MT
in US and GB”
• FAHQ MT too hard because we would have to
encode all of human knowledge
• Instead we should work on computer tools for
human translators
Dan Jurafsky
The claim that fully automatic high
quality MT is impossible
Yehoshua Bar-Hillel. 1960. A Demonstration of the Nonfeasibility of
Fully Automatic High Quality Translation.
• Little John was looking for his toy
box. Finally he found it. The box was
in the pen. John was very happy.
 Pen1: Enclosure for small children
• Pen1: Enclosure for small children
• Pen2: Writing utensil
Dan Jurafsky
• The box was in the pen.
Dan Jurafsky
The claim that fully automatic high
quality MT is impossible
Yehoshua Bar-Hillel, 1960
“I now claim that no existing or imaginable
program will enable an electronic computer
to determine…”
Dan Jurafsky
The state of the art in MT
Dan Jurafsky
The state of the art in MT
Dan Jurafsky
History of MT: Further Pessimism
The ALPAC report
• Headed by John R. Pierce of Bell Labs
• Conclusions:
• MT doesn’t work
• MT a failure: all current MT work had to be post-edited
• Intelligibility and informativeness worse than human
• We don’t need MT anyhow
• Already too many human translators from Russian
• Results: MT research suffered
• Funding loss
• Number of research labs declined
• Association for Machine Translation and Computational
Linguistics dropped MT from its name
Dan Jurafsky
MT in the modern age
• 1975-1985 Resurgence of MT in Europe and Japan
• Domain-specific rule-based systems
• 1990-present
• Rise of Statistical Machine Translation
Machine
Translation
Introduction to
MT
Machine
Translation
Language
Divergences
Dan Jurafsky
Language Similarities and Divergences
• Typology:
• the study of systematic cross-linguistic similarities
and differences
• What are the dimensions along which human
languages vary?
Dan Jurafsky
Syntactic Variation: Basic Word Orders
In many languages one word order is more basic
• SVO (Subject-Verb-Object) languages
English, German, French, Mandarin
I baked a pizza
• SOV Languages
Japanese, Hindi
English: He adores listening to music
Japanese: kare ha ongaku wo kiku no ga daisuki desu
he
music
to listening
• VSO languages
• Irish, Classical Arabic, Tagalog
adores
Dan Jurafsky
Morphology
• Morpheme: “Minimal meaningful unit of language”
Word = Morpheme + Morpheme + Morpheme +…
• Stems: (base form, root)
hope+ing  hoping hop  hopping
• Affixes
• Prefixes: Antidisestablishmentarianism
• Suffixes: Antidisestablishmentarianism
• Infixes: hingi (borrow) – humingi (borrower) in Tagalog
• Circumfixes: sagen (say) – gesagt (said) in German
Dan Jurafsky
Morphemes per Word
Joseph Greenberg. 1954. A Quantitative Approach to the Morphological
Typology of Language. IJAL 26:3.
isolating
synthetic
1
1.06
Vietnamese
2
1.68
English
3
2.17
2.55
Yakut Swahili
(Turkic)
4
3.72
West
Greenlandic
(EskimoInuit)
Dan Jurafsky
Few morphemes per word: Cantonese
“He said this was the biggest building in the whole country”
Each word in this sentence has one morpheme (and one
syllable):
keui wa chyuhn gwok
jeui daaih gaan nguk haih li
gaan
he
say entire country most big
bldg house is
this bldg
Dan Jurafsky
Many Morphemes per word: Turkish
uygarlaştıramadıklarımızdanmışsınızcasına
uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
Behaving as if you are among those whom we could not cause to
become civilized
Dan Jurafsky
Word Segmentation
Are word boundaries marked in writing?
• Some writing systems: boundaries between words not
marked
• Chinese, Japanese, Thai
• Word segmentation becomes an important part of text
normalization for MT
• Some languages tend to have sentences that are quite
long, closer to English paragraphs than sentences:
• Modern Standard Arabic, Chinese
• Sentence segmentation may be necessary for MT between
these languages and languages like English
Dan Jurafsky
Inferential Load:
cold vs. hot languages
Balthasar Bickel. 2003. Referential density in discourse and syntactic typology. Language 79:2, 708-36
• Hot languages:
• Who did what to whom is marked explicitly
• English
• Cold languages:
• The hearer has more “figuring out” of who the various actors
in the various events are
• Japanese, Chinese
Dan Jurafsky
Inferential Load: The blue noun phrases
are not in the Chinese original
飓风丽塔已经减弱为第三级飓风,
Rita weakened and was downgraded to a Category 3 storm;
ø 迫近美国德课萨斯州和路易斯安那州,
[Rita/it/the storm] is moving close to Texas and Louisiana;
当局表示,
the authorities announced;
虽然 ø 在登陆前可能再稍微减弱,
although [Rita/it/the storm] might weaken again before landing,
但 ø 仍然会非常危险,
[Rita/it/the storm] is still very dangerous;
ø 预料 ø 会在当地时间星期六凌晨在德州和路易斯安那州之间登陆,
[the authorities] predict [Rita/it/the storm] will arrive at the TexasLouisiana border on Saturday morning local time;
ø 直接吹袭休斯敦市东面的主要炼油设施。
[Rita/it/the storm] will directly hit the oil-refining industry east of
Houston.
Dan Jurafsky
Lexical Divergences
• Word to phrases:
• English
• French
computer science
informatique
• Part of Speech divergences
• English She likes to sing
• German Sie singt gerne [She sings likefully]
• English
• Spanish
I’m hungry
Tengo hambre
[I have hunger]
Dan Jurafsky
Lexical Specificity Divergences
• Grammatical specificity
• Spanish: plural pronouns have gender (ellos/ellas)
• English: plural pronouns no gender (they)
• So translating “they” from English to Spanish, need to
figure out gender of the referent!
Dan Jurafsky
Lexical Divergences: Semantic
Specificity
English brother
Mandarin gege (older brother), didi (younger brother)
English wall
German Wand (inside)
Mauer (outside)
English fish
Spanish pez (the creature) pescado (fish as food)
Cantonese ngau
English
cow
beef
Dan Jurafsky
Predicate Argument divergences
L. Talmy. 1985. Lexicalization patterns: Semantic Structure in Lexical Form.
• English
The bottle floated out.
Spanish
La botella salió flotando.
The bottle exited floating
• Satellite-framed languages:
• direction of motion is marked on the satellite
• Crawl out, float off, jump down, walk over to,
run after
• Most of Indo-European, Hungarian, Finnish, Chinese
• Verb-framed languages:
• direction of motion is marked on the verb
• Spanish, French, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan,
Bantu families
Dan Jurafsky
Predicate Argument divergences:
Heads and Argument swapping
Dorr, Bonnie J., "Machine Translation Divergences: A Formal Description and
Proposed Solution," Computational Linguistics, 20:4, 597--633
Heads:
Arguments:
English: X swim across Y
Spanish: X crucar Y nadando
Spanish: Y me gusta
English: I like Y
English: I like to eat
German: Ich esse gern
German: Der Termin fällt mir ein
English: I forget the date
English: I’d prefer vanilla
German: Mir wäre Vanille lieber
Dan Jurafsky
Predicate-Argument Divergence Counts
B.Dorr et al. 2002. DUSTer: A Method for Unraveling Cross-Language
Divergences for Statistical Word-Level Alignment
Found divergences in 32% of sentences in UN Spanish/English Corpus
Part of Speech
X tener hambre
Y have hunger
Phrase/Light verb X dar puñaladas a Z
X stab Z
98%
83%
Structural
X entrar en Y
X enter Y
35%
Heads swap
X cruzar Y nadando
X swim across Y
8%
Arguments swap
X gustar a Y
Y likes X
6%
Machine
Translation
Language
Divergences
Machine
Translation
Three classical
methods for MT
Dan Jurafsky
3 Classical methods for MT
• Direct
• Transfer
• Interlingua
Dan Jurafsky
Three MT Approaches: Direct, Transfer,
Interlingual
Dan Jurafsky
Direct Translation
•
•
•
•
Proceed word-by-word through text
Translating each word
No intermediate structures except morphology
Knowledge is in the form of
• Huge bilingual dictionary
• word-to-word translation information
• After word translation, can do simple reordering
• Adjective ordering English -> French/Spanish
Dan Jurafsky
Direct MT Dictionary entry
Dan Jurafsky
Direct MT
Dan Jurafsky
Problems with direct MT
• German
• Chinese
Dan Jurafsky
The Transfer Model
• Idea: apply contrastive knowledge, i.e., knowledge
about the difference between two languages
• Steps:
• Analysis: Syntactically parse source language
• Transfer: Rules to turn this parse into parse for
target language
• Generation: Generate target sentence from parse
tree
Dan Jurafsky
English to French
English: Adjective Noun
French: Noun Adjective
• This is not always true
Route mauvaise ‘bad road, badly-paved road’
Mauvaise route ‘wrong road’
• But is a reasonable first approximation
• Rule:
Dan Jurafsky
Transfer rules
Dan Jurafsky
Transferring the green witch….
45
Dan Jurafsky
Interlingua
• Instead of N2 sets of transfer rules
• Use meaning as a representation language
1. Parse source sentence into meaning representation
2. Generate target sentence from meaning.
• Intuition: Use other NLP applications to do MT work
• English book to Spanish: libro or reservar
• Disambiguate book into concepts BOOKVOLUME and RESERVE
• Need 2N systems (a parser and generator for each
language)
Dan Jurafsky
Interlingua for
Mary did not slap the green witch
Machine
Translation
Three classical
methods for MT
Download