Machine Translation CS 4705

advertisement
Machine Translation
CS 4705
Current Translation Systems
Senator John F. Kerry is casting himself as an
underdog in the fight of his political life as he
seeks to jump-start his presidential campaign
today with a major speech outlining the first 100
days of a Kerry presidency. He will promise health
care legislation as his first bill to Congress, new
limits on government lobbying, and a rebuke of
President Bush's military doctrine of preemptive
war.
Senador Juan F. Kerry es el bastidor mismo pues un
oprimido en la lucha de su vida política como él
busca saltar-empieza su campaña presidencial hoy
con un discurso importante que contornea los
primeros 100 días de una presidencia del kerry. Él
prometerá la legislación del cuidado médico como
su primera cuenta al congreso, nuevos límites en
el gobierno cabildeando, y una reprimenda de la
doctrina militar de presidente Bush de la guerra
con derecho preferente.
Senator Juan F. Kerry is the same frame because
pressing in the fight of its political life as he looks
for jump-begins its presidential campaign today
with an important speech that it skirts the first 100
days of a presidency of kerry. He will promise the
legislation of the medical care as his first account
to the congress, new limits in the government
lobbying, and a reprimand of the military doctrine
of president Bush of the war with preferred right.
Goals for MT
• Aids for human translators
– Online translating dictionaries
– Access to aligned examples
• Rough translation plus post-editing
• Full translation in
– Limited domains
– With additional resources such as previous
source/target document pairs
– Spoken dialogue systems
• Cross-language IR: translate the query terms and
search
• A related area: Language Identification
A Simple Approach to MT
Le petit chat ont mange le melon/The small cat ate
melon.
Les petits chats/The small cats
Le chat brun/The brown cat
Les petits pois/The peas
ont mange/ate
le melon/melon
How do languages differ?
• How easy is it to segment text into words?
– English vs. Mandarin, Korean, Chinese
• How many morphemes per word?
– Turkish vs. Vietnamese
• How easy is it to segment words into morphemes?
– Russian vs. Turkish
• Word order
–
–
–
–
SVO: English, Romance, Mandarin
SOV: Japanese
VSO: Arabic, Hebrew
Free: Walpiri and (German, Hindi, Korean,???)
• Where are relationships marked? (head or non)
– (e.g. possessive) -- Hungarian vs. English
– prepositions vs. postpositions (English vs. Japanese)
• Does the language mark gender? Number?
Deference? Animacy? Kinship relations?
• Lexical gaps (It. Convento/Eng. Monastery,
Convent)
• Socio-cultural idiosyncracies:
– dates (U.S. vs. Europe)
– numbers (French vs. English 80)
– telephone numbers
Sapir/Whorfianism
• Linguistic determinism: our language determines
our thought
• Linguistic relativism: those who speak different
languages perceive reality quite differently
•
We dissect nature along lines laid down by our native languages. The
categories and types that we isolate from the world of phenomena we do not
find there because they stare every observer in the face; on the contrary, the
world is presented in a kaleidoscopic flux of impressions which has to be
organized by our minds - and this means largely by the linguistic systems in
our minds. We cut nature up, organize it into concepts, and ascribe
significances as we do, largely because we are parties to an agreement to
organize it in this way - an agreement that holds throughout our speech
community and is codified in the patterns of our language. The agreement is,
of course, an implicit and unstated one, but its terms are absolutely obligatory;
we cannot talk at all except by subscribing to the organization and
classification of data which the agreement decrees. (Whorf 1940, pp. 213-14)
Transfer Approaches to MT
• Require deep linguistic knowledge of specific
source and target
– Model differences: contrastive approach
• Analysis --> Transfer --> Generation
– Syntactic parse produces parse tree for source
– Syntactic transformation maps source tree to target tree
– Lexical transfer substitutes target leaves for source
leaves
– Handling agreement, particles, inflection
• NP Det Adj N {English} the brown cats
• NP Det N Adj {French} les chats bruns
Weaknesses of Transfer Approaches
• Requires transfer rules for each pair of languages
– Good syntactic parsers
– Good word-sense disambiguation
Interlingua Approaches
• Idea: extract meaning from target and represent in
language-independent form
– One set of transformation rules for each language to be
translated
– A well-chosen ontology:
• Capable of representing any meaning that any
language can convey
• Often, thematic role-based
• Semantic analysis  semantic representation 
Target generation
– The cats are eating the melon/Les chats mangent le
melon.
EVENT
AGENT
eating
cats
NUMBER
PL
DEFINITENESS
PATIENT melon
NUMBER
SG
DEFINITENESS
ASPECT PROGRESSIVE
TENSE PRESENT
DEF
DEF
Drawbacks of Interlingua Approaches
• Ontology development is time-consuming
– Possible only for limited domains
– Requires full disambiguation even when current
source/target don’t make distinctions
– Adding another language may require major redesign
Direct Translation
• A ‘robust’ approach to translation: do what you
can without complex structures
• Resembles multi-stage transduction
Source: The cats ate the melon.
Morph: The cats/N+PL ate/V+Past the melon/N+SG.
Lexical Transfer: The chat PL manger Past+PL the melon
SG
Generation: Les chats ont mangé le melon.
Statistical Approaches:The Noisy Channel
Model Again
Noisy Channel
Source
Decoder
– Input: the translation; Output: the source sentence
– Decoding:
max
Ť = arg
P(T|S)
TL2
= arg max P(SP|T(S)P) (T )
TL2
= arg max P(S|T) P(T)
TL2
How do we develop such models?
• Ngram language models: to quantify fluency
• Parallel corpora, e.g. Hansards, to quantify
faithfulness
– Alignment problem: words and sentences
•
Mr. Michel Guimond (Beauport-Montmorency-Orléans, BQ): moved that
Bill C-369, an act to amend the Criminal Code (gaming and betting), be read
the second time and referred to a committee.
•
He said: Mr. Speaker, on March 12, I spoke before the Sub-committee on
Private Members' Business to introduce a private member's bill which would
have made it possible to open casinos on cruise ships sailing on the St.
Lawrence and the Great Lakes.
•
This bill reflected, not some fantasy of the federal member for BeauportMontmorency-Orléans, but a need expressed after long consultations with port
administrators, community organizations and municipalities along the St.
Lawrence. A number of municipal councils have even gone so far as to pass
resolutions in support of Bill C-369, not the least of these being Quebec City,
Beauport, in my riding, Charlesbourg and Ancienne-Lorette. I also consulted
with ship owners, organizations promoting navigation on the St. Lawrence,
and tourist associations.
•
M. Michel Guimond (Beauport-Montmorency-Orléans, BQ) propose: Que
le projet de loi C-369, Loi modifiant le Code criminel (jeux et paris), soit
maintenant lu une deuxième fois et renvoyé à un comité.
•
-Monsieur le Président, le 12 mars dernier, je me présentais devant le Souscomité des affaires émanant des députés pour présenter un projet de loi
d'initiative privée qui aurait permis l'ouverture des casinos à bord des bateaux
de croisière qui naviguent sur le Saint-Laurent et les Grands Lacs.
•
Ce projet de loi n'était pas une fantaisie du député fédéral de BeauportMontmorency-Orléans, mais bien un besoin exprimé, après une longue
consultation, auprès des dirigeants des ports, des organismes du milieu et des
municipalités environnantes du Saint-Laurent. Plusieurs conseils municipaux
sont même allés jusqu'à adopter des résolutions appuyant le projet de loi C369. Je n'en nomme que quelques-unes, et non les moindres: la Ville de
Québec, la Ville de Beauport, dans mon comté, la Ville de Charlesbourg et la
Ville de l'Ancienne-Lorette. Mes consultations se sont aussi orientées vers les
armateurs, les organismes promouvant la navigation sur le Saint-Laurent et les
associations touristiques.
Evaluating Translation Systems
• Human judgments: expensive
• Objective measures: may not accord with humans
• Solution: find objective measures that correlate
highly with human judgments
• IBM BLEU metric:
– Modified ngram precision: count number of ngrams in
candidate that occur in any reference translation,
clipped at max for that ngram in any reference
translation, summed and divided by unclipped number
of candidates
Ref1: The cat is on the mat.
Ref2: There is a cat on the mat.
Cand1: the the the the the the the the (2/7)
– Combining ngram scores
– Penalizing brief translations
• Bleu closely predicts human judgments
• Reference transcriptions can be re-used for
subsequent system evaluation
Future Directions
•
•
•
•
Phrase translation
Example Based MT
Serial translation, e.g. manuals
Speech to Speech Translation
What is NLP?
• Syntax vs semantics vs discourse-pragmatics
• Knowledge-rich, linguistic vs robust, statistical
approaches
• Text vs speech oriented research
• Generation vs understanding
Download