Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

advertisement
Rapid Prototyping
of a Transfer-based
Hebrew-to-English
Machine Translation System
Alon Lavie
Language Technologies Institute
Carnegie Mellon University
Joint work with:
Shuly Wintner, Danny Shacham, Nurit Melnik, Yuval
Krymolowski - University of Haifa
Erik Peterson – Carnegie Mellon University
Outline
•
•
•
•
•
•
•
•
•
Context of this Work
CMU Statistical Transfer MT Framework
Hebrew and its Challenges for MT
Hebrew-to-English System
Morphological Analysis and Generation
MT Resources: lexicon and grammar
Translation Examples
Performance Evaluation
Conclusions, Current and Future Work
June 20, 2007
ISCOL/BISFAI-2007
2
Current State-of-the-art in
Machine Translation
• MT underwent a major paradigm shift over the past 15
years:
– From manually crafted rule-based systems with manually
designed knowledge resources
– To search-based approaches founded on automatic
extraction of translation models/units from large sentenceparallel corpora
• Current Dominant Approach: Phrase-based Statistical
MT:
– Extract and statistically model large volumes of phrase-tophrase correspondences from automatically word-aligned
parallel corpora
– “Decode” new input by searching for the most likely
sequence of phrase matches, using a statistical Language
Model for the target language
June 20, 2007
ISCOL/BISFAI-2007
3
Current State-of-the-art in
Machine Translation
• Phrase-based MT State-of-the-art:
– Requires minimally several million words of parallel
text for adequate training
– Limited to language-pairs for which such data exists:
major European languages, Chinese, Japanese, a few
others…
– Linguistically shallow and highly lexicalized models
result in weak generalization
– Best performance levels (BLEU=~0.6) on Arabic-toEnglish provide understandable but often still
somewhat disfluent translations
– Ill suited for Hebrew and most of the world’s minor
languages
June 20, 2007
ISCOL/BISFAI-2007
4
CMU’s Statistical-Transfer
(XFER) Approach
• Framework: Statistical search-based approach with
syntactic translation transfer rules that can be acquired
from data but also developed and extended by experts
• Elicitation: use bilingual native informants to produce a
small high-quality word-aligned bilingual corpus of
translated phrases and sentences
• Transfer-rule Learning: apply ML-based methods to
automatically acquire syntactic transfer rules for
translation between the two languages
• XFER + Decoder:
– XFER engine produces a lattice of possible transferred
structures at all levels
– Decoder searches and selects the best scoring combination
• Rule Refinement: refine the acquired rules via a process
of interaction with bilingual informants
• Word and Phrase bilingual lexicon acquisition
June 20, 2007
ISCOL/BISFAI-2007
5
Hebrew Input
‫בשורה הבאה‬
Transfer Rules
{NP1,3}
NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]
((X3::Y1)
(X1::Y2)
((X1 def) = +)
((X1 status) =c absolute)
((X1 num) = (X3 num))
((X1 gen) = (X3 gen))
(X0 = X1))
Preprocessing
Morphology
English
Language
Model
Transfer
Engine
Translation Lexicon
N::N |: ["$WR"] -> ["BULL"]
((X1::Y1)
((X0 NUM) = s)
((Y0 lex) = "BULL"))
N::N |: ["$WRH"] -> ["LINE"]
((X1::Y1)
((X0 NUM) = s)
((Y0 lex) = "LINE"))
Decoder
Translation
Output Lattice
(0 1 "IN" @PREP)
(1 1 "THE" @DET)
(2 2 "LINE" @N)
(1 2 "THE LINE" @NP)
(0 2 "IN LINE" @PP)
(0 4 "IN THE NEXT LINE" @PP)
English Output
in the next line
Transfer Rule Formalism
;SL: the old man, TL: ha-ish ha-zaqen
Type information
Part-of-speech/constituent
information
Alignments
x-side constraints
[DET ADJ N] -> [DET N DET ADJ]
((X1 AGR) = *3-SING)
((X1 DEF = *DEF)
((X3 AGR) = *3-SING)
((X3 COUNT) = +)
y-side constraints
xy-constraints,
e.g. ((Y1 AGR) = (X1 AGR))
June 20, 2007
NP::NP
(
(X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2)
((Y1 DEF) = *DEF)
((Y3 DEF) = *DEF)
((Y2 AGR) = *3-SING)
((Y2 GENDER) = (Y4 GENDER))
)
ISCOL/BISFAI-2007
7
The Transfer Engine
• Main algorithm: chart-style bottom-up integrated
parsing+transfer with beam pruning
– Seeded by word-to-word translations
– Driven by transfer rules
– Generates a lattice of transferred translation segments at
all levels
• Some Unique Features:
– Works with either learned or manually-developed transfer
grammars
– Handles rules with or without unification constraints
– Supports interfacing with servers for morphological
analysis and generation
– Can handle ambiguous source-word analyses and/or SL
segmentations represented in the form of lattice structures
June 20, 2007
ISCOL/BISFAI-2007
8
XFER Output Lattice
(28
(29
(29
(29
(30
(30
(30
(30
(30
(30
(30
28
29
29
29
30
30
30
30
30
30
30
"AND" -5.6988 "W" "(CONJ,0 'AND')")
"SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE')) ")
"SINCE THEN" -12.0165 "MAZ " "(ADVP,0 (ADV,6 'SINCE THEN')) ")
"EVER SINCE" -12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE')) ")
"WORKED" -10.9913 "&BD " "(VERB,0 (V,11 'WORKED')) ")
"FUNCTIONED" -16.0023 "&BD " "(VERB,0 (V,10 'FUNCTIONED')) ")
"WORSHIPPED" -17.3393 "&BD " "(VERB,0 (V,12 'WORSHIPPED')) ")
"SERVED" -11.5161 "&BD " "(VERB,0 (V,14 'SERVED')) ")
"SLAVE" -13.9523 "&BD " "(NP0,0 (N,34 'SLAVE')) ")
"BONDSMAN" -18.0325 "&BD " "(NP0,0 (N,36 'BONDSMAN')) ")
"A SLAVE" -16.8671 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0
(N,34 'SLAVE')) ) ) ) ")
(30 30 "A BONDSMAN" -21.0649 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0
(NP0,0 (N,36 'BONDSMAN')) ) ) ) ")
June 20, 2007
ISCOL/BISFAI-2007
9
The Lattice Decoder
• Simple Stack Decoder, similar in principle to simple
Statistical MT decoders
• Searches for best-scoring path of non-overlapping
lattice arcs
• No reordering during decoding
• Scoring based on log-linear combination of scoring
components, with weights trained using MERT
• Scoring components:
– Statistical Language Model
– Fragmentation: how many arcs to cover the entire
translation?
– Length Penalty
– Rule Scores
– Lexical Probabilities (not fully integrated)
June 20, 2007
ISCOL/BISFAI-2007
10
XFER Lattice Decoder
00
ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEAL
Overall: -8.18323, Prob: -94.382, Rules: 0, Frag: 0.153846, Length: 0,
Words: 13,13
235 < 0 8 -19.7602: B H IWM RBI&I (PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE')
(NP2,0 (NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1 (N,6 'DAY')))))))>
918 < 8 14 -46.2973: H ARIH AKL AT H $PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0
(NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0 'ATE'))(NP,100
(NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,24 'RABBIT')))))))>
584 < 14 17 -30.6607: L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1 (LITERAL 'A')
(NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32 'MORNING'))(NP0,0 (N,27 'MEAL')))))))>
June 20, 2007
ISCOL/BISFAI-2007
11
XFER MT Prototypes
• General XFER framework under development for past
five years
• Prototype systems so far:
–
–
–
–
–
German-to-English
Dutch-to-English
Chinese-to-English
Hindi-to-English
Hebrew-to-English
–
–
–
–
–
Mapudungun-to-Spanish
Quechua-to-Spanish
Brazilian Portuguese-to-English
Native-Brazilian languages to Brazilian Portuguese
Hebrew-to-Arabic
• In progress or planned:
June 20, 2007
ISCOL/BISFAI-2007
12
Challenges for Hebrew MT
• Puacity in existing language resources for
Hebrew
– No publicly available broad coverage morphological
analyzer
– No publicly available bilingual lexicons or dictionaries
– No POS-tagged corpus or parse tree-bank corpus for
Hebrew
– No large Hebrew/English parallel corpus
• Scenario well suited for CMU transfer-based
MT framework for languages with limited
resources
June 20, 2007
ISCOL/BISFAI-2007
13
Modern Hebrew Spelling
• Two main spelling variants
– “KTIV XASER” (difficient): spelling with the vowel
diacritics, and consonant words when the diacritics
are removed
– “KTIV MALEH” (full): words with I/O/U vowels are
written with long vowels which include a letter
• KTIV MALEH is predominant, but not strictly
adhered to even in newspapers and official
publications  inconsistent spelling
• Example:
– niqud (spelling): NIQWD, NQWD, NQD
– When written as NQD, could also be niqed, naqed,
nuqad
June 20, 2007
ISCOL/BISFAI-2007
14
Morphological Analyzer
• We use a publicly available morphological
analyzer distributed by the Technion’s
Knowledge Center, adapted for our system
• Coverage is reasonable (for nouns, verbs and
adjectives)
• Produces all analyses or a disambiguated
analysis for each word
• Output format includes lexeme (base form),
POS, morphological features
• Output was adapted to our representation
needs (POS and feature mappings)
June 20, 2007
ISCOL/BISFAI-2007
15
Morphology Example
• Input word: B$WRH
0
1
2
3
4
|--------B$WRH--------|
|-----B-----|$WR|--H--|
|--B--|-H--|--$WRH---|
June 20, 2007
ISCOL/BISFAI-2007
16
Morphology Example
Y0: ((SPANSTART 0)
(SPANEND 4)
(LEX B$WRH)
(POS N)
(GEN F)
(NUM S)
(STATUS ABSOLUTE))
Y1: ((SPANSTART 0)
(SPANEND 2)
(LEX B)
(POS PREP))
Y2: ((SPANSTART 1)
(SPANEND 3)
(LEX $WR)
(POS N)
(GEN M)
(NUM S)
(STATUS ABSOLUTE))
Y3: ((SPANSTART 3)
(SPANEND 4)
(LEX $LH)
(POS POSS))
Y4: ((SPANSTART 0)
(SPANEND 1)
(LEX B)
(POS PREP))
Y5: ((SPANSTART 1)
(SPANEND 2)
(LEX H)
(POS DET))
Y6: ((SPANSTART 2)
(SPANEND 4)
(LEX $WRH)
(POS N)
(GEN F)
(NUM S)
(STATUS ABSOLUTE))
Y7: ((SPANSTART 0)
(SPANEND 4)
(LEX B$WRH)
(POS LEX))
June 20, 2007
ISCOL/BISFAI-2007
17
Translation Lexicon
• Constructed our own Hebrew-to-English lexicon, based
primarily on existing “Dahan” H-to-E and E-to-H
dictionary made available to us, augmented by other
public sources
• Coverage is not great but not bad as a start
– Dahan H-to-E is about 15K translation pairs
– Dahan E-to-H is about 7K translation pairs
• Base forms, POS information on both sides
• Converted Dahan into our representation, added entries
for missing closed-class entries (pronouns, prepositions,
etc.)
• Had to deal with spelling conventions
• Recently augmented with ~50K translation pairs
extracted from Wikipedia (mostly proper names and
named entities)
June 20, 2007
ISCOL/BISFAI-2007
18
Manual Transfer Grammar
(human-developed)
• Initially developed by Alon in a couple of days,
extended and revised by Nurit over time
• Current grammar has 36 rules:
–
–
–
–
21 NP rules
one PP rule
6 verb complexes and VP rules
8 higher-phrase and sentence-level rules
• Captures the most common (mostly local)
structural differences between Hebrew and
English
June 20, 2007
ISCOL/BISFAI-2007
19
Transfer Grammar
Example Rules
{NP1,2}
;;SL: $MLH ADWMH
;;TL: A RED DRESS
{NP1,3}
;;SL: H $MLWT H ADWMWT
;;TL: THE RED DRESSES
NP1::NP1 [NP1 ADJ] -> [ADJ NP1]
(
(X2::Y1)
(X1::Y2)
((X1 def) = -)
((X1 status) =c absolute)
((X1 num) = (X2 num))
((X1 gen) = (X2 gen))
(X0 = X1)
)
NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]
(
(X3::Y1)
(X1::Y2)
((X1 def) = +)
((X1 status) =c absolute)
((X1 num) = (X3 num))
((X1 gen) = (X3 gen))
(X0 = X1)
)
June 20, 2007
ISCOL/BISFAI-2007
20
Hebrew-to-English MT Prototype
• Initial prototype developed within a two month
intensive effort
• Accomplished:
–
–
–
–
–
–
–
Adapted available morphological analyzer
Constructed a preliminary translation lexicon
Translated and aligned Elicitation Corpus
Learned XFER rules
Developed (small) manual XFER grammar
System debugging and development
Evaluated performance on unseen test data using
automatic evaluation metrics
June 20, 2007
ISCOL/BISFAI-2007
21
Example Translation
• Input:
– ‫לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא הנסיגה‬
– After debates many decided the government to hold
referendum in issue the withdrawal
• Output:
– AFTER MANY DEBATES THE GOVERNMENT DECIDED
TO HOLD A REFERENDUM ON THE ISSUE OF THE
WITHDRAWAL
June 20, 2007
ISCOL/BISFAI-2007
22
Noun Phrases – Construct State
‫החלטת הנשיא הראשון‬
HXL@T
[HNSIA
HRA$WN]
decision.3SF-CS the-president.3SM the-first.3SM
THE DECISION OF THE FIRST PRESIDENT
‫החלטת הנשיא הראשונה‬
[HXL@T
HNSIA]
decision.3SF-CS the-president.3SM
HRA$WNH
the-first.3SF
THE FIRST DECISION OF THE PRESIDENT
June 20, 2007
ISCOL/BISFAI-2007
23
Noun Phrases - Possessives
‫הנשיא הכריז שהמשימה הראשונה שלו תהיה למצוא פתרון לסכסוך באזורנו‬
HNSIA
HKRIZ
$HM$IMH
HRA$WNH $LW
THIH
the-president announced that-the-task.3SF the-first.3SF of-him will.3SF
LMCWA PTRWN LSKSWK
to-find solution to-the-conflict
BAZWRNW
in-region-POSS.1P
Without transfer grammar:
THE PRESIDENT ANNOUNCED THAT THE TASK THE BEST OF HIM
WILL BE TO FIND SOLUTION TO THE CONFLICT IN REGION OUR
With transfer grammar:
THE PRESIDENT ANNOUNCED THAT HIS FIRST TASK WILL BE
TO FIND A SOLUTION TO THE CONFLICT IN OUR REGION
June 20, 2007
ISCOL/BISFAI-2007
24
Subject-Verb Inversion
‫אתמול הודיעה הממשלה שתערכנה בחירות בחודש הבא‬
ATMWL
HWDI&H
HMM$LH
yesterday announced.3SF the-government.3SF
$T&RKNH
BXIRWT
BXWD$
HBA
that-will-be-held.3PF elections.3PF in-the-month the-next
Without transfer grammar:
YESTERDAY ANNOUNCED THE GOVERNMENT THAT WILL RESPECT
OF THE FREEDOM OF THE MONTH THE NEXT
With transfer grammar:
YESTERDAY THE GOVERNMENT ANNOUNCED THAT ELECTIONS
WILL ASSUME IN THE NEXT MONTH
June 20, 2007
ISCOL/BISFAI-2007
25
Subject-Verb Inversion
‫לפני כמה שבועות הודיעה הנהלת המלון שהמלון יסגר בסוף השנה‬
LPNI
before
KMH $BW&WT HWDI&H
HNHLT
HMLWN
several weeks
announced.3SF management.3SF.CS the-hotel
$HMLWN
ISGR
BSWF
H$NH
that-the-hotel.3SM will-be-closed.3SM at-end.3SM.CS the-year
Without transfer grammar:
IN FRONT OF A FEW WEEKS ANNOUNCED ADMINISTRATION THE
HOTEL THAT THE HOTEL WILL CLOSE AT THE END THIS YEAR
With transfer grammar:
SEVERAL WEEKS AGO THE MANAGEMENT OF THE HOTEL ANNOUNCED
THAT THE HOTEL WILL CLOSE AT THE END OF THE YEAR
June 20, 2007
ISCOL/BISFAI-2007
26
Evaluation Results
• Test set of 62 sentences from Haaretz
newspaper, 2 reference translations
System
BLEU
NIST
P
R
METEOR
No Gram
0.0616 3.4109 0.4090
0.4427
0.3298
Learned
0.0774 3.5451 0.4189
0.4488
0.3478
Manual
0.1026 3.7789 0.4334
0.4474
0.3617
June 20, 2007
ISCOL/BISFAI-2007
27
Current and Future Work
• Issues specific to the Hebrew-to-English system:
– Coverage: further improvements in the translation lexicon
and morphological analyzer
– Manual Grammar development
– Acquiring/training of word-to-word translation probabilities
– Acquiring/training of a Hebrew language model at a postmorphology level that can help with disambiguation
• General Issues related to XFER framework:
–
–
–
–
Discriminative Language Modeling for MT
Effective models for assigning scores to transfer rules
Improved grammar learning
Merging/integration of manual and acquired grammars
June 20, 2007
ISCOL/BISFAI-2007
28
Conclusions
• Test case for the CMU XFER framework for
rapid MT prototyping
• Preliminary system was a two-month, three
person effort – we were quite happy with the
outcome
• Core concept of XFER + Decoding is very
powerful and promising for MT
• We experienced the main bottlenecks of
knowledge acquisition for MT: morphology,
translation lexicons, grammar...
June 20, 2007
ISCOL/BISFAI-2007
29
Questions?
June 20, 2007
ISCOL/BISFAI-2007
30
Download