ppt

advertisement
MT for Languages with Limited
Resources
11-731
Machine Translation
April 20, 2011
Based on Joint Work with: Lori Levin, Jaime
Carbonell, Stephan Vogel, Shuly Wintner, Danny
Shacham, Katharina Probst, Erik Peterson, Christian
Monson, Roberto Aranovich and Ariadna Font-Llitjos
Why Machine Translation
for Minority and Indigenous Languages?
• Commercial MT economically feasible for only
a handful of major languages with large
resources (corpora, human developers)
• Is there hope for MT for languages with very
limited resources?
• Benefits include:
– Better government access to indigenous
communities (Epidemics, crop failures, etc.)
– Better indigenous communities participation in
information-rich activities (health care, education,
government) without giving up their languages.
– Language preservation
– Civilian and military applications (disaster relief)
April 20, 2011
11-731 Machine Translation
2
MT for Minority and Indigenous
Languages: Challenges
• Minimal amount of parallel text
• Possibly competing standards for
orthography/spelling
• Often relatively few trained linguists
• Access to native informants possible
• Need to minimize development time
and cost
April 20, 2011
11-731 Machine Translation
3
MT for Low Resource Languages
• Possible Approaches:
– Phrase-based SMT, with whatever small amounts of
parallel data that is available
– Build a rule-based system – need for bilingual
experts and resources
– Hybrid approaches, such as the AVENUE Project
(Stat-XFER) approach:
• Incorporate acquired manual resources within a
general statistical framework
• Augment with targeted elicitation and resource
acquisition from bilingual non-experts
April 20, 2011
11-731 Machine Translation
4
CMU Statistical Transfer
(Stat-XFER) MT Approach
• Integrate the major strengths of rule-based and
statistical MT within a common framework:
– Linguistically rich formalism that can express complex and
abstract compositional transfer rules
– Rules can be written by human experts and also acquired
automatically from data
– Easy integration of morphological analyzers and
generators
– Word and syntactic-phrase correspondences can be
automatically acquired from parallel text
– Search-based decoding from statistical MT adapted to find
the best translation within the search space: multi-feature
scoring, beam-search, parameter optimization, etc.
– Framework suitable for both resource-rich and resourcepoor language scenarios
April 20, 2011
11-731 Machine Translation
5
Stat-XFER Main Principles
• Framework: Statistical search-based approach with
syntactic translation transfer rules that can be acquired
from data but also developed and extended by experts
• Automatic Word and Phrase translation lexicon
acquisition from parallel data
• Transfer-rule Learning: apply ML-based methods to
automatically acquire syntactic transfer rules for
translation between the two languages
• Elicitation: use bilingual native informants to produce a
small high-quality word-aligned bilingual corpus of
translated phrases and sentences
• Rule Refinement: refine the acquired rules via a process
of interaction with bilingual informants
• XFER + Decoder:
– XFER engine produces a lattice of possible transferred
structures at all levels
– Decoder searches and selects the best scoring combination
April 20, 2011
11-731 Machine Translation
6
Stat-XFER Framework
Source
Input
Preprocessing
Language Weighted
Model
Features
Morphology
Transfer
Rules
Bilingual
Lexicon
April 20, 2011
Transfer
Engine
Translation
Lattice
11-731 Machine Translation
Second-Stage
Decoder
Target
Output
7
Hebrew Input
‫בשורה הבאה‬
Transfer Rules
{NP1,3}
NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]
((X3::Y1)
(X1::Y2)
((X1 def) = +)
((X1 status) =c absolute)
((X1 num) = (X3 num))
((X1 gen) = (X3 gen))
(X0 = X1))
Preprocessing
Morphology
English
Language
Model
Transfer
Engine
Translation Lexicon
N::N |: ["$WR"] -> ["BULL"]
((X1::Y1)
((X0 NUM) = s)
((Y0 lex) = "BULL"))
N::N |: ["$WRH"] -> ["LINE"]
((X1::Y1)
((X0 NUM) = s)
((Y0 lex) = "LINE"))
Decoder
Translation
Output Lattice
(0 1 "IN" @PREP)
(1 1 "THE" @DET)
(2 2 "LINE" @N)
(1 2 "THE LINE" @NP)
(0 2 "IN LINE" @PP)
(0 4 "IN THE NEXT LINE" @PP)
English Output
in the next line
Transfer Rule Formalism
;SL: the old man, TL: ha-ish ha-zaqen
Type information
Part-of-speech/constituent
information
Alignments
x-side constraints
[DET ADJ N] -> [DET N DET ADJ]
((X1 AGR) = *3-SING)
((X1 DEF = *DEF)
((X3 AGR) = *3-SING)
((X3 COUNT) = +)
y-side constraints
xy-constraints,
e.g. ((Y1 AGR) = (X1 AGR))
April 20, 2011
NP::NP
(
(X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2)
((Y1 DEF) = *DEF)
((Y3 DEF) = *DEF)
((Y2 AGR) = *3-SING)
((Y2 GENDER) = (Y4 GENDER))
)
11-731 Machine Translation
9
Transfer Rule Formalism (II)
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP
(
(X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2)
Value constraints
Agreement constraints
April 20, 2011
[DET ADJ N] -> [DET N DET ADJ]
((X1 AGR) = *3-SING)
((X1 DEF = *DEF)
((X3 AGR) = *3-SING)
((X3 COUNT) = +)
((Y1 DEF) = *DEF)
((Y3 DEF) = *DEF)
((Y2 AGR) = *3-SING)
((Y2 GENDER) = (Y4 GENDER))
)
11-731 Machine Translation
10
Translation Lexicon:
Hebrew-to-English Examples
(Semi-manually-developed)
PRO::PRO |: ["ANI"] -> ["I"]
(
(X1::Y1)
((X0 per) = 1)
((X0 num) = s)
((X0 case) = nom)
)
N::N |: ["$&H"] -> ["HOUR"]
(
(X1::Y1)
((X0 NUM) = s)
((Y0 NUM) = s)
((Y0 lex) = "HOUR")
)
PRO::PRO |: ["ATH"] -> ["you"]
(
(X1::Y1)
((X0 per) = 2)
((X0 num) = s)
((X0 gen) = m)
((X0 case) = nom)
)
N::N |: ["$&H"] -> ["hours"]
(
(X1::Y1)
((Y0 NUM) = p)
((X0 NUM) = p)
((Y0 lex) = "HOUR")
)
April 20, 2011
11-731 Machine Translation
11
Translation Lexicon:
French-to-English Examples
(Automatically-acquired)
DET::DET |: [“le"] -> [“the"]
(
(X1::Y1)
)
NP::NP |: [“le respect"] -> [“accordance"]
(
)
PP::PP |: [“dans le respect"] -> [“in accordance"]
(
)
Prep::Prep |:[“dans”] -> [“in”]
(
(X1::Y1)
)
N::N |: [“principes"] -> [“principles"]
(
(X1::Y1)
)
PP::PP |: [“des principes"] -> [“with the principles"]
(
)
N::N |: [“respect"] -> [“accordance"]
(
(X1::Y1)
)
April 20, 2011
11-731 Machine Translation
12
Hebrew-English Transfer Grammar
Example Rules
(Manually-developed)
{NP1,2}
;;SL: $MLH ADWMH
;;TL: A RED DRESS
{NP1,3}
;;SL: H $MLWT H ADWMWT
;;TL: THE RED DRESSES
NP1::NP1 [NP1 ADJ] -> [ADJ NP1]
(
(X2::Y1)
(X1::Y2)
((X1 def) = -)
((X1 status) =c absolute)
((X1 num) = (X2 num))
((X1 gen) = (X2 gen))
(X0 = X1)
)
NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]
(
(X3::Y1)
(X1::Y2)
((X1 def) = +)
((X1 status) =c absolute)
((X1 num) = (X3 num))
((X1 gen) = (X3 gen))
(X0 = X1)
)
April 20, 2011
11-731 Machine Translation
13
French-English Transfer Grammar
Example Rules
(Automatically-acquired)
{PP,24691}
;;SL: des principes
;;TL: with the principles
{PP,312}
;;SL: dans le respect des principes
;;TL: in accordance with the principles
PP::PP [“des” N] -> [“with the” N]
(
(X1::Y1)
)
PP::PP [Prep NP] -> [Prep NP]
(
(X1::Y1)
(X2::Y2)
)
April 20, 2011
11-731 Machine Translation
14
The Transfer Engine
• Input: source-language input sentence, or sourcelanguage confusion network
• Output: lattice representing collection of translation
fragments at all levels supported by transfer rules
• Basic Algorithm: “bottom-up” integrated “parsingtransfer-generation” chart-parser guided by the
synchronous transfer rules
– Start with translations of individual words and phrases
from translation lexicon
– Create translations of larger constituents by applying
applicable transfer rules to previously created lattice
entries
– Beam-search controls the exponential combinatorics of the
search-space, using multiple scoring features
April 20, 2011
11-731 Machine Translation
15
The Transfer Engine
• Some Unique Features:
– Works with either learned or manually-developed
transfer grammars
– Handles rules with or without unification constraints
– Supports interfacing with servers for morphological
analysis and generation
– Can handle ambiguous source-word analyses and/or
SL segmentations represented in the form of lattice
structures
April 20, 2011
11-731 Machine Translation
16
Hebrew Example
(From [Lavie et al., 2004])
• Input word: B$WRH
0
1
2
3
4
|--------B$WRH--------|
|-----B-----|$WR|--H--|
|--B--|-H--|--$WRH---|
April 20, 2011
11-731 Machine Translation
17
Hebrew Example
(From [Lavie et al., 2004])
Y0: ((SPANSTART 0)
(SPANEND 4)
(LEX B$WRH)
(POS N)
(GEN F)
(NUM S)
(STATUS ABSOLUTE))
Y1: ((SPANSTART 0)
(SPANEND 2)
(LEX B)
(POS PREP))
Y2: ((SPANSTART 1)
(SPANEND 3)
(LEX $WR)
(POS N)
(GEN M)
(NUM S)
(STATUS ABSOLUTE))
Y3: ((SPANSTART 3)
(SPANEND 4)
(LEX $LH)
(POS POSS))
Y4: ((SPANSTART 0)
(SPANEND 1)
(LEX B)
(POS PREP))
Y5: ((SPANSTART 1)
(SPANEND 2)
(LEX H)
(POS DET))
Y6: ((SPANSTART 2)
(SPANEND 4)
(LEX $WRH)
(POS N)
(GEN F)
(NUM S)
(STATUS ABSOLUTE))
Y7: ((SPANSTART 0)
(SPANEND 4)
(LEX B$WRH)
(POS LEX))
April 20, 2011
11-731 Machine Translation
18
XFER Output Lattice
(28
(29
(29
(29
(30
(30
(30
(30
(30
(30
(30
28
29
29
29
30
30
30
30
30
30
30
"AND" -5.6988 "W" "(CONJ,0 'AND')")
"SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE')) ")
"SINCE THEN" -12.0165 "MAZ " "(ADVP,0 (ADV,6 'SINCE THEN')) ")
"EVER SINCE" -12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE')) ")
"WORKED" -10.9913 "&BD " "(VERB,0 (V,11 'WORKED')) ")
"FUNCTIONED" -16.0023 "&BD " "(VERB,0 (V,10 'FUNCTIONED')) ")
"WORSHIPPED" -17.3393 "&BD " "(VERB,0 (V,12 'WORSHIPPED')) ")
"SERVED" -11.5161 "&BD " "(VERB,0 (V,14 'SERVED')) ")
"SLAVE" -13.9523 "&BD " "(NP0,0 (N,34 'SLAVE')) ")
"BONDSMAN" -18.0325 "&BD " "(NP0,0 (N,36 'BONDSMAN')) ")
"A SLAVE" -16.8671 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0
(N,34 'SLAVE')) ) ) ) ")
(30 30 "A BONDSMAN" -21.0649 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0
(NP0,0 (N,36 'BONDSMAN')) ) ) ) ")
April 20, 2011
11-731 Machine Translation
19
The Lattice Decoder
• Stack Decoder, similar to standard Statistical MT
decoders
• Searches for best-scoring path of non-overlapping
lattice arcs
• No reordering during decoding
• Scoring based on log-linear combination of scoring
features, with weights trained using Minimum Error Rate
Training (MERT)
• Scoring components:
–
–
–
–
Statistical Language Model
Bi-directional MLE phrase and rule scores
Lexical Probabilities
Fragmentation: how many arcs to cover the entire
translation?
– Length Penalty: how far from expected target length?
April 20, 2011
11-731 Machine Translation
20
XFER Lattice Decoder
00
ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEAL
Overall: -8.18323, Prob: -94.382, Rules: 0, Frag: 0.153846, Length: 0,
Words: 13,13
235 < 0 8 -19.7602: B H IWM RBI&I (PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE')
(NP2,0 (NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1 (N,6 'DAY')))))))>
918 < 8 14 -46.2973: H ARIH AKL AT H $PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0
(NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0 'ATE'))(NP,100
(NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,24 'RABBIT')))))))>
584 < 14 17 -30.6607: L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1 (LITERAL 'A')
(NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32 'MORNING'))(NP0,0 (N,27 'MEAL')))))))>
April 20, 2011
11-731 Machine Translation
21
Stat-XFER MT Systems
• General Stat-XFER framework under development for past
nine years
• Systems so far:
–
–
–
–
–
–
–
–
–
–
–
–
–
Chinese-to-English
French-to-English
Hebrew-to-English
Urdu-to-English
German-to-English
Hindi-to-English
Dutch-to-English
Turkish-to-English
Mapudungun-to-Spanish
Arabic-to-English
Brazilian Portuguese-to-English
English-to-Arabic
Hebrew-to-Arabic
April 20, 2011
11-731 Machine Translation
22
Learning Transfer-Rules for
Languages with Limited Resources
• Rationale:
– Large bilingual corpora not available
– Bilingual native informant(s) can translate and align a
small pre-designed elicitation corpus, using elicitation tool
– Elicitation corpus designed to be typologically
comprehensive and compositional
– Transfer-rule engine and rule learning approach support
acquisition of generalized transfer-rules from the data
April 20, 2011
11-731 Machine Translation
23
English-Chinese Example
April 20, 2011
11-731 Machine Translation
24
English-Hindi Example
April 20, 2011
11-731 Machine Translation
25
Spanish-Mapudungun
Example
April 20, 2011
11-731 Machine Translation
26
English-Arabic Example
April 20, 2011
11-731 Machine Translation
27
The Typological Elicitation Corpus
• Translated, aligned by bilingual informant
• Corpus consists of linguistically diverse
constructions
• Based on elicitation and documentation work
of field linguists (e.g. Comrie 1977, Bouquiaux
1992)
• Organized compositionally: elicit simple
structures first, then use them as building
blocks
• Goal: minimize size, maximize linguistic
coverage
April 20, 2011
11-731 Machine Translation
28
The Structural Elicitation Corpus
• Designed to cover the most common phrase
structures of English  learn how these
structures map onto their equivalents in other
languages
• Constructed using the constituent parse trees
from the Penn TreeBank
– Extracted and frequency ranked all rules in parse
trees
– Selected top ~200 rules, filtered idiosyncratic cases
– Revised lexical choices within examples
• Goal: minimize size, maximize linguistic
coverage of structures
April 20, 2011
11-731 Machine Translation
29
The Structural Elicitation Corpus
Examples:
srcsent: in the forest
tgtsent: B H I&R
aligned: ((1,1),(2,2),(3,3))
context: C-Structure:(<PP> (PREP in-1) (<NP> (DET the-2) (N forest-3)))
srcsent: steps
tgtsent: MDRGWT
aligned: ((1,1))
context: C-Structure:(<NP> (N steps-1))
srcsent: the boy ate the apple
tgtsent: H ILD AKL AT H TPWX
aligned: ((1,1),(2,2),(3,3),(4,5),(5,6))
context: C-Structure:(<S> (<NP> (DET the-1) (N boy-2))
(<VP> (V ate-3) (<NP> (DET the-4)(N apple-5))))
srcsent: the first year
tgtsent: H $NH H RA$WNH
aligned: ((1,1 3),(2,4),(3,2))
context: C-Structure:(<NP> (DET the-1) (<ADJP> (ADJ first-2)) (N year-3))
April 20, 2011
11-731 Machine Translation
30
A Limited Data Scenario for
Hindi-to-English
• Conducted during a DARPA “Surprise
Language Exercise” (SLE) in June 2003
• Put together a scenario with “miserly” data
resources:
– Elicited Data corpus: 17589 phrases
– Cleaned portion (top 12%) of LDC dictionary: ~2725
Hindi words (23612 translation pairs)
– Manually acquired resources during the SLE:
•
•
•
•
500 manual bigram translations
72 manually written phrase transfer rules
105 manually written postposition rules
48 manually written time expression rules
• No additional parallel text!!
April 20, 2011
11-731 Machine Translation
31
Examples of Learned Rules
(Hindi-to-English)
{NP,14244}
;;Score:0.0429
NP::NP [N] -> [DET N]
(
(X1::Y2)
)
{NP,14434}
;;Score:0.0040
NP::NP [ADJ CONJ ADJ N] ->
[ADJ CONJ ADJ N]
(
(X1::Y1) (X2::Y2)
(X3::Y3) (X4::Y4)
)
April 20, 2011
{PP,4894}
;;Score:0.0470
PP::PP [NP POSTP] -> [PREP NP]
(
(X2::Y1)
(X1::Y2)
)
11-731 Machine Translation
32
Manual Transfer Rules: Hindi Example
;; PASSIVE OF SIMPLE PAST (NO AUX) WITH LIGHT VERB
;; passive of 43 (7b)
{VP,28}
VP::VP : [V V V] -> [Aux V]
(
(X1::Y2)
((x1 form) = root)
((x2 type) =c light)
((x2 form) = part)
((x2 aspect) = perf)
((x3 lexwx) = 'jAnA')
((x3 form) = part)
((x3 aspect) = perf)
(x0 = x1)
((y1 lex) = be)
((y1 tense) = past)
((y1 agr num) = (x3 agr num))
((y1 agr pers) = (x3 agr pers))
((y2 form) = part)
)
April 20, 2011
11-731 Machine Translation
33
Manual Transfer Rules: Example
; NP1 ke NP2 -> NP2 of NP1
; Ex: jIvana ke eka aXyAya
;
life of (one) chapter
; ==> a chapter of life
;
{NP,12}
NP::NP : [PP NP1] -> [NP1 PP]
(
(X1::Y2)
(X2::Y1)
; ((x2 lexwx) = 'kA')
)
NP
NP
PP
NP1
NP1
NP P
Adj N
Adj N
N1 ke eka aXyAya
PP
P
one chapter of N1
N
N
jIvana
life
{NP,13}
NP::NP : [NP1] -> [NP1]
(
(X1::Y1)
)
{PP,12}
PP::PP : [NP Postp] -> [Prep NP]
(
(X1::Y2)
(X2::Y1)
)
April 20, 2011
11-731 Machine Translation
NP
34
Manual Grammar Development
• Covers mostly NPs, PPs and VPs (verb
complexes)
• ~70 grammar rules, covering basic and
recursive NPs and PPs, verb complexes
of main tenses in Hindi (developed in
two weeks)
April 20, 2011
11-731 Machine Translation
35
Testing Conditions
• Tested on section of JHU provided data: 258
sentences with four reference translations
–
–
–
–
SMT system (stand-alone)
EBMT system (stand-alone)
XFER system (naïve decoding)
XFER system with “strong” decoder
• No grammar rules (baseline)
• Manually developed grammar rules
• Automatically learned grammar rules
– XFER+SMT with strong decoder (MEMT)
April 20, 2011
11-731 Machine Translation
36
Results on JHU Test Set
System
BLEU
M-BLEU
NIST
EBMT
0.058
0.165
4.22
SMT
0.093
0.191
4.64
0.055
0.177
4.46
0.109
0.224
5.29
XFER
0.116
0.231
5.37
XFER
0.135
0.243
5.59
XFER+SMT
0.136
0.243
5.65
XFER
(naïve)
man grammar
XFER
(strong)
no grammar
(strong)
learned grammar
(strong)
man grammar
April 20, 2011
11-731 Machine Translation
37
Effect of Reordering in the
Decoder
NIST vs. Reordering
5.7
5.6
5.5
NIST score
5.4
no grammar
learned grammar
manual grammar
MEMT: SFXER+ SMT
5.3
5.2
5.1
5
4.9
4.8
0
1
2
3
4
reordering window
April 20, 2011
11-731 Machine Translation
38
Observations and Lessons (I)
• XFER with strong decoder outperformed SMT
even without any grammar rules in the miserly
data scenario
– SMT Trained on elicited phrases that are very short
– SMT has insufficient data to train more discriminative
translation probabilities
– XFER takes advantage of Morphology
• Token coverage without morphology: 0.6989
• Token coverage with morphology: 0.7892
• Manual grammar was somewhat better than
automatically learned grammar
– Learned rules were very simple
– Large room for improvement on learning rules
April 20, 2011
11-731 Machine Translation
39
Observations and Lessons (II)
• MEMT (XFER and SMT) based on strong decoder
produced best results in the miserly scenario.
• Reordering within the decoder provided very
significant score improvements
– Much room for more sophisticated grammar rules
– Strong decoder can carry some of the reordering
“burden”
April 20, 2011
11-731 Machine Translation
40
Modern Hebrew
• Native language of about 3-4 Million in Israel
• Semitic language, closely related to Arabic and
with similar linguistic properties
– Root+Pattern word formation system
– Rich verb and noun morphology
– Particles attach as prefixed to the following word:
definite article (H), prepositions (B,K,L,M),
coordinating conjuction (W), relativizers ($,K$)…
• Unique alphabet and Writing System
– 22 letters represent (mostly) consonants
– Vowels represented (mostly) by diacritics
– Modern texts omit the diacritic vowels, thus
additional level of ambiguity: “bare” word  word
– Example: MHGR  mehager, m+hagar, m+h+ger
April 20, 2011
11-731 Machine Translation
41
Modern Hebrew Spelling
• Two main spelling variants
– “KTIV XASER” (difficient): spelling with the vowel
diacritics, and consonant words when the diacritics
are removed
– “KTIV MALEH” (full): words with I/O/U vowels are
written with long vowels which include a letter
• KTIV MALEH is predominant, but not strictly
adhered to even in newspapers and official
publications  inconsistent spelling
• Example:
– niqud (spelling): NIQWD, NQWD, NQD
– When written as NQD, could also be niqed, naqed,
nuqad
April 20, 2011
11-731 Machine Translation
42
Challenges for Hebrew MT
• Puacity in existing language resources for
Hebrew
– No publicly available broad coverage morphological
analyzer
– No publicly available bilingual lexicons or dictionaries
– No POS-tagged corpus or parse tree-bank corpus for
Hebrew
– No large Hebrew/English parallel corpus
• Scenario well suited for CMU transfer-based
MT framework for languages with limited
resources
April 20, 2011
11-731 Machine Translation
43
Morphological Analyzer
• We use a publicly available morphological
analyzer distributed by the Technion’s
Knowledge Center, adapted for our system
• Coverage is reasonable (for nouns, verbs and
adjectives)
• Produces all analyses or a disambiguated
analysis for each word
• Output format includes lexeme (base form),
POS, morphological features
• Output was adapted to our representation
needs (POS and feature mappings)
April 20, 2011
11-731 Machine Translation
44
Morphology Example
• Input word: B$WRH
0
1
2
3
4
|--------B$WRH--------|
|-----B-----|$WR|--H--|
|--B--|-H--|--$WRH---|
April 20, 2011
11-731 Machine Translation
45
Morphology Example
Y0: ((SPANSTART 0)
(SPANEND 4)
(LEX B$WRH)
(POS N)
(GEN F)
(NUM S)
(STATUS ABSOLUTE))
Y1: ((SPANSTART 0)
(SPANEND 2)
(LEX B)
(POS PREP))
Y2: ((SPANSTART 1)
(SPANEND 3)
(LEX $WR)
(POS N)
(GEN M)
(NUM S)
(STATUS ABSOLUTE))
Y3: ((SPANSTART 3)
(SPANEND 4)
(LEX $LH)
(POS POSS))
Y4: ((SPANSTART 0)
(SPANEND 1)
(LEX B)
(POS PREP))
Y5: ((SPANSTART 1)
(SPANEND 2)
(LEX H)
(POS DET))
Y6: ((SPANSTART 2)
(SPANEND 4)
(LEX $WRH)
(POS N)
(GEN F)
(NUM S)
(STATUS ABSOLUTE))
Y7: ((SPANSTART 0)
(SPANEND 4)
(LEX B$WRH)
(POS LEX))
April 20, 2011
11-731 Machine Translation
46
Translation Lexicon
• Constructed our own Hebrew-to-English lexicon, based
primarily on existing “Dahan” H-to-E and E-to-H
dictionary made available to us, augmented by other
public sources
• Coverage is not great but not bad as a start
– Dahan H-to-E is about 15K translation pairs
– Dahan E-to-H is about 7K translation pairs
• Base forms, POS information on both sides
• Converted Dahan into our representation, added entries
for missing closed-class entries (pronouns, prepositions,
etc.)
• Had to deal with spelling conventions
• Recently augmented with ~50K translation pairs
extracted from Wikipedia (mostly proper names and
named entities)
April 20, 2011
11-731 Machine Translation
47
Manual Transfer Grammar
(human-developed)
• Initially developed by Alon in a couple of days,
extended and revised by Nurit over time
• Current grammar has 36 rules:
–
–
–
–
21 NP rules
one PP rule
6 verb complexes and VP rules
8 higher-phrase and sentence-level rules
• Captures the most common (mostly local)
structural differences between Hebrew and
English
April 20, 2011
11-731 Machine Translation
48
Transfer Grammar
Example Rules
{NP1,2}
;;SL: $MLH ADWMH
;;TL: A RED DRESS
{NP1,3}
;;SL: H $MLWT H ADWMWT
;;TL: THE RED DRESSES
NP1::NP1 [NP1 ADJ] -> [ADJ NP1]
(
(X2::Y1)
(X1::Y2)
((X1 def) = -)
((X1 status) =c absolute)
((X1 num) = (X2 num))
((X1 gen) = (X2 gen))
(X0 = X1)
)
NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]
(
(X3::Y1)
(X1::Y2)
((X1 def) = +)
((X1 status) =c absolute)
((X1 num) = (X3 num))
((X1 gen) = (X3 gen))
(X0 = X1)
)
April 20, 2011
11-731 Machine Translation
49
Hebrew-to-English MT Prototype
• Initial prototype developed within a two month
intensive effort
• Accomplished:
–
–
–
–
–
–
–
Adapted available morphological analyzer
Constructed a preliminary translation lexicon
Translated and aligned Elicitation Corpus
Learned XFER rules
Developed (small) manual XFER grammar
System debugging and development
Evaluated performance on unseen test data using
automatic evaluation metrics
April 20, 2011
11-731 Machine Translation
50
Example Translation
• Input:
– ‫לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא הנסיגה‬
– After debates many decided the government to hold
referendum in issue the withdrawal
• Output:
– AFTER MANY DEBATES THE GOVERNMENT DECIDED
TO HOLD A REFERENDUM ON THE ISSUE OF THE
WITHDRAWAL
April 20, 2011
11-731 Machine Translation
51
Noun Phrases – Construct State
‫החלטת הנשיא הראשון‬
HXL@T
[HNSIA
HRA$WN]
decision.3SF-CS the-president.3SM the-first.3SM
THE DECISION OF THE FIRST PRESIDENT
‫החלטת הנשיא הראשונה‬
[HXL@T
HNSIA]
decision.3SF-CS the-president.3SM
HRA$WNH
the-first.3SF
THE FIRST DECISION OF THE PRESIDENT
April 20, 2011
11-731 Machine Translation
52
Noun Phrases - Possessives
‫הנשיא הכריז שהמשימה הראשונה שלו תהיה למצוא פתרון לסכסוך באזורנו‬
HNSIA
HKRIZ
$HM$IMH
HRA$WNH $LW
THIH
the-president announced that-the-task.3SF the-first.3SF of-him will.3SF
LMCWA PTRWN LSKSWK
to-find solution to-the-conflict
BAZWRNW
in-region-POSS.1P
Without transfer grammar:
THE PRESIDENT ANNOUNCED THAT THE TASK THE BEST OF HIM
WILL BE TO FIND SOLUTION TO THE CONFLICT IN REGION OUR
With transfer grammar:
THE PRESIDENT ANNOUNCED THAT HIS FIRST TASK WILL BE
TO FIND A SOLUTION TO THE CONFLICT IN OUR REGION
April 20, 2011
11-731 Machine Translation
53
Subject-Verb Inversion
‫אתמול הודיעה הממשלה שתערכנה בחירות בחודש הבא‬
ATMWL
HWDI&H
HMM$LH
yesterday announced.3SF the-government.3SF
$T&RKNH
BXIRWT
BXWD$
HBA
that-will-be-held.3PF elections.3PF in-the-month the-next
Without transfer grammar:
YESTERDAY ANNOUNCED THE GOVERNMENT THAT WILL RESPECT
OF THE FREEDOM OF THE MONTH THE NEXT
With transfer grammar:
YESTERDAY THE GOVERNMENT ANNOUNCED THAT ELECTIONS
WILL ASSUME IN THE NEXT MONTH
April 20, 2011
11-731 Machine Translation
54
Subject-Verb Inversion
‫לפני כמה שבועות הודיעה הנהלת המלון שהמלון יסגר בסוף השנה‬
LPNI
before
KMH $BW&WT HWDI&H
HNHLT
HMLWN
several weeks
announced.3SF management.3SF.CS the-hotel
$HMLWN
ISGR
BSWF
H$NH
that-the-hotel.3SM will-be-closed.3SM at-end.3SM.CS the-year
Without transfer grammar:
IN FRONT OF A FEW WEEKS ANNOUNCED ADMINISTRATION THE
HOTEL THAT THE HOTEL WILL CLOSE AT THE END THIS YEAR
With transfer grammar:
SEVERAL WEEKS AGO THE MANAGEMENT OF THE HOTEL ANNOUNCED
THAT THE HOTEL WILL CLOSE AT THE END OF THE YEAR
April 20, 2011
11-731 Machine Translation
55
Evaluation Results
• Test set of 62 sentences from Haaretz
newspaper, 2 reference translations
System
BLEU
NIST
P
R
METEOR
No Gram
0.0616 3.4109 0.4090
0.4427
0.3298
Learned
0.0774 3.5451 0.4189
0.4488
0.3478
Manual
0.1026 3.7789 0.4334
0.4474
0.3617
April 20, 2011
11-731 Machine Translation
56
Current and Future Work
• Issues specific to the Hebrew-to-English system:
– Coverage: further improvements in the translation lexicon
and morphological analyzer
– Manual Grammar development
– Acquiring/training of word-to-word translation probabilities
– Acquiring/training of a Hebrew language model at a postmorphology level that can help with disambiguation
• General Issues related to XFER framework:
–
–
–
–
Discriminative Language Modeling for MT
Effective models for assigning scores to transfer rules
Improved grammar learning
Merging/integration of manual and acquired grammars
April 20, 2011
11-731 Machine Translation
57
Conclusions
• Test case for the CMU XFER framework for
rapid MT prototyping
• Preliminary system was a two-month, three
person effort – we were quite happy with the
outcome
• Core concept of XFER + Decoding is very
powerful and promising for low-resource MT
• We experienced the main bottlenecks of
knowledge acquisition for MT: morphology,
translation lexicons, grammar...
April 20, 2011
11-731 Machine Translation
58
Mapudungun-to-Spanish
Example
English
I didn’t see Maria
Mapudungun
pelafiñ Maria
Spanish
No vi a María
April 20, 2011
11-731 Machine Translation
59
Mapudungun-to-Spanish
Example
English
I didn’t see Maria
Mapudungun
pelafiñ Maria
pe -la -fi
-ñ
Maria
see -neg -3.obj -1.subj.indicative Maria
Spanish
No vi a María
No vi
a María
neg see.1.subj.past.indicative acc Maria
April 20, 2011
11-731 Machine Translation
60
pe-la-fi-ñ Maria
V
pe
April 20, 2011
11-731 Machine Translation
61
pe-la-fi-ñ Maria
V
pe
VSuff
Negation = +
la
April 20, 2011
11-731 Machine Translation
62
pe-la-fi-ñ Maria
V
pe
VSuffG
Pass all features up
VSuff
la
April 20, 2011
11-731 Machine Translation
63
pe-la-fi-ñ Maria
V
pe
VSuffG VSuff
object person = 3
VSuff
fi
la
April 20, 2011
11-731 Machine Translation
64
pe-la-fi-ñ Maria
V
pe
VSuffG
VSuffG VSuff
VSuff
Pass all features up
from both children
fi
la
April 20, 2011
11-731 Machine Translation
65
pe-la-fi-ñ Maria
V
pe
VSuffG
VSuff
VSuffG VSuff
ñ
VSuff
person = 1
number = sg
mood = ind
fi
la
April 20, 2011
11-731 Machine Translation
66
pe-la-fi-ñ Maria
VSuffG
V
pe
VSuffG
VSuff
VSuffG VSuff
ñ
VSuff
Pass all features up
from both children
fi
la
April 20, 2011
11-731 Machine Translation
67
pe-la-fi-ñ Maria
Pass all features up
from both children
V
V
pe
Check that:
1) negation = +
VSuffG
2) tense is undefined
VSuffG
VSuff
VSuffG VSuff
VSuff
ñ
fi
la
April 20, 2011
11-731 Machine Translation
68
pe-la-fi-ñ Maria
NP
V
pe
N
VSuffG
V
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
person = 3
number = sg
human = +
fi
la
April 20, 2011
11-731 Machine Translation
69
pe-la-fi-ñ Maria
S
Pass features up
from V
Check that NP is
human = +
VP
NP
V
pe
N
VSuffG
V
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
fi
la
April 20, 2011
11-731 Machine Translation
70
Transfer to Spanish: Top-Down
S
S
VP
VP
NP
V
pe
N
VSuffG
V
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
fi
la
April 20, 2011
11-731 Machine Translation
71
Transfer to Spanish: Top-Down
Pass all features
to Spanish side
S
S
VP
VP
NP
V
pe
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
“a”
NP
N
VSuffG
V
V
fi
la
April 20, 2011
11-731 Machine Translation
72
Transfer to Spanish: Top-Down
S
VP
NP
V
pe
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
VP
V
“a”
NP
N
VSuffG
V
S
Pass all
features
down
fi
la
April 20, 2011
11-731 Machine Translation
73
Transfer to Spanish: Top-Down
S
S
VP
VP
NP
V
pe
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
“a”
NP
N
VSuffG
V
V
Pass
object
features
down
fi
la
April 20, 2011
11-731 Machine Translation
74
Transfer to Spanish: Top-Down
S
S
VP
VP
NP
V
pe
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
“a”
NP
N
VSuffG
V
V
fi
Accusative
marker on
objects is
introduced
because
human = +
la
April 20, 2011
11-731 Machine Translation
75
Transfer to Spanish: Top-Down
S
S
VP VP::VP [VBar NP] -> [VBar "a" NP] VP
( (X1::Y1)
NP
V
V
“a”
NP
(X2::Y3)
pe
N = (*NOT* personal))
((X2 type)
((X2 human) =c +)
VSuff
(X0 = N
X1)
((X0 object) = X2)
VSuffG
V
VSuffG
VSuffG VSuff
VSuff
la
April 20, 2011
fi
ñ (Y0Maria
= X0)
((Y0 object) = (X0 object))
(Y1 = Y0)
(Y3 = (Y0 object))
((Y1 objmarker person) = (Y3 person))
((Y1 objmarker number) = (Y3 number))
((Y1 objmarker gender) = (Y3 ender)))
11-731 Machine Translation
76
Transfer to Spanish: Top-Down
S
VP
NP
V
pe
N
VSuffG
V
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
S
Pass person, number,
andVP
mood features to
Spanish Verb
“a”
NP
V
“no” V
Assign tense = past
fi
la
April 20, 2011
11-731 Machine Translation
77
Transfer to Spanish: Top-Down
S
S
VP
VP
NP
V
pe
N
VSuffG
V
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
fi
“a”
V
NP
“no” V
Introduced because
negation = +
la
April 20, 2011
11-731 Machine Translation
78
Transfer to Spanish: Top-Down
S
S
VP
VP
NP
V
pe
N
VSuffG
V
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
“a”
V
“no”
NP
V
ver
fi
la
April 20, 2011
11-731 Machine Translation
79
Transfer to Spanish: Top-Down
S
S
VP
VP
NP
V
pe
N
VSuffG
V
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
“a”
V
“no”
V
ver
vi
fi
la
April 20, 2011
NP
11-731 Machine Translation
person = 1
number = sg
mood = indicative
tense = past
80
Transfer to Spanish: Top-Down
S
S
Pass features over to
VP Spanish side
VP
NP
V
pe
N
VSuffG
V
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
“a”
V
“no”
NP
V
N
vi
N
María
fi
la
April 20, 2011
11-731 Machine Translation
81
I Didn’t see Maria
S
S
VP
VP
NP
V
pe
N
VSuffG
V
VSuffG
VSuff
N
VSuffG VSuff
ñ
Maria
VSuff
“a”
V
“no”
NP
V
N
vi
N
María
fi
la
April 20, 2011
11-731 Machine Translation
82
Download