Dependency Trees and Machine Translation Vamshi Ambati

advertisement
Dependency Trees and
Machine Translation
Vamshi Ambati
Vamshi@cs.cmu.edu
Spring 2008 Adv MT Seminar
02 April 2008
Today
• Introduction
– Dependency formalism
– Syntax in Machine Translation
• Dependency Tree based Machine
Translation
– By projection
– By synchronous modeling
• Conclusion and Future
Today
• Introduction
– Dependency formalism
– Syntax in Machine Translation
• Dependency Tree based Machine
Translation
– By projection
– By synchronous modeling
• Conclusion and Future
Dependency Trees
Phrase Structure Trees
John
gave
Mary
an
apple
Dependency Trees
Phrase Structure Trees: Labels
S
VP
NP
John:N gave:V
Mary:N an:DT
apple:N
Dependency Trees
Head Percolation:
- Usually done deterministically
gave
- Assuming one head per phrase*
gave
apple
John
gave
Mary
an
apple
Dependency Trees
gave
apple
John
Mary
an
Dependency Trees
John
gave
Mary
an
apple
Dependency Trees: Basics
SUBJ
John
•
•
•
•
Child
Dependent
Modifier
Modifier
(optional)
gave
•
•
•
•
Parent
Governor
Head
Modified
• The direction of arrows can be head-child or child-head
(has to be mentioned)
Dependency Trees: Basics
• Properties
– Every word has a single head/parent
• Except for the root
– Completely connected tree
– Acyclic
• If wi→wj then never wj→*wi
• Variants
– Projective: Non-crossing between dependencies
• If wi ->wj , then for all k between i and j, either wk ->wi or wk >wj holds
– Non-Projective: Allow crossings between
depdenencies
Projective dependency tree
ounces
Projectiveness: all the words between here finally
depend on either on “was” or “.”
Example credit: Yuji Matsumoto, NAIST, Japan
Non-projective dependency tree
Direction of edges: from a parent to the children
Note: Phrases thus extracted which are united by dependencies could be discontinuous
Example from: R. McDonald and F. Pereira EACL, 2006.
Dependency Grammar (DG) in the Grammar
Formalism Timeline
• Panini (2600 years ago, India) recognised, distinguished and
classified semantic, syntactic and morphological dependencies
(Bharati, Natural Language Processing)
• The Arabic grammarians (1200 years ago, Iraq) recognised
government and syntactic dependency structure, (The Foundations
of Grammar - Owens)
• The Latin grammarians (800 years ago) recognised 'determination'
and dependency structures. - Percival, "Reflections on the History of
Dependency Notions“
• Lucien Tesnie`re (1930s, France) developed a relatively formal and
sophisticated theory of DG grammar for use in schools
• PSG, CCG etc were around the same time in early 20th century
Source: ELLSSI 2000 Tutorial on Dependency Grammars
Dependency Trees: some
phenomenon
• DG has been widely accepted as a variant of
PSG, but it is not strongly equivalent
– Constituents are implicit in a DepTree and can be
derived
– Relations are explicit and can be labelled although
optional
– No explicit non-terminal nodes, which means no
unary productions too
– Can handle discontinuous phrases too
• Known problems with Coordination and Gerunds
Phrase structure vs Dependency
• Phrase structure suitable to languages with
– rather fixed word order patterns
– clear constituency structures
– English etc
• Dependency structure suitable to languages with
– greater freedom of word order
– order is controlled more by pragmatic than by
syntactic factors
– Slavonic (Czech, Polish) and some Romance (Italian ,
spanish etc)
Today
• Introduction
– Dependency formalism
– Syntax in Machine Translation
• Dependency Tree based Machine
Translation
– By projection
– By synchronous modeling
• Conclusion and Future
Phrasal SMT discussion
• Advantages:
– Do not have to compose translations unnecessarily
– Local re-ordering captured in phrases
– Already specific to the domain and capture context locally
• Disadvantages:
–
–
–
–
–
Specificity and no generalization
Discontiguous phrases not considered
Global reordering
Estimation problems (long vs short phrases)
Can not model phenomenon across phrases
• Limitations:
– Phrase sizes (how much before I run into out of memory?)
– Corpus Availability makes it feasible only to certain language
pairs
Syntax in MT: Many Representations
•
•
•
•
•
•
•
WordLevel MT : No syntax
SMT: Phrases / contiguous sequences
SMT Hierarchical : Pseudo Syntax
Syntax based SMT : Constituent
Syntax based SMT: CCG
Syntax based SMT: LFG
Syntax based SMT: Dependency
Syntax in MT: Many ways of incorporation
• Pre-procesing
– Reordering input
– Reordered training corpus
• Translation models
– Syntactically informed alignment models
– Better distortion models
• Language Models
– Syntactic language models
– Syntax motivated models
• Post-processing
–
–
–
–
Nbest list reranking with syntactic information
Translation correction: Case marker/TAM correction
True casing etc?
Multi combinations with Syntactic backbones?
Syntax based SMT discussion
• Inversion Transduction Grammar (Wu ‘96)
–
–
–
–
Very constrained form of syntax : One non-terminal
Some expressive limitations
Not linguistically motivated
Effectively learns preferences for flip/no-flip
• Generative Tree to String (Yamada & Knight 2001)
– Expressiveness (last week presentation)
– No discontiguous phrases
• Multitext grammars (Melamed 2003)
– Formalized, but MT work yet to be realized
• Hierarchical MT (Chang 2005)
–
–
–
–
Linguistic generalizations
Handles discontiguous phrases recursively
Estimation problems and Phrase table are increased even more
Across phrase boundary modeling
Syntax in MT and Dependency Trees
Source side tree is provided
Se
Target side is obtained by projection
Syntax
Problem of Isomorphism between trees
Source
Target
• head-switching
• empty-dep ; extra-dep
Tree and String
Se
Sf
Syntax
Syntax
Source
Tree and Tree
Target
Source side tree is provided
Target side is provided
Ideally non-isomorphic trees should be
modeled too
Today
• Introduction
– Dependency formalism
– Syntax in Machine Translation
• Dependency Tree based Machine
Translation
– By projection
– By synchronous modeling
• Conclusion and Future
Dependency Tree based Machine
Translation
• By projection
– Fox 2002
– Dekang Lin 2004
– Quirk et al 2004, Quirk et al 2006, Menezes et al
2007
• By synchronous modeling
–
–
–
–
Alshawi et al 2001
Jason Eisner 2003
Fox 2005
Yuang Lin and Daniel Marcu 2004
Phrasal Cohesion and Statistical Machine
Translation
Heidi Fox , EMNLP 2002
• English-French Corpus was
used
– En-Fr are similar
• For phrase structure trees – Head Crossings involve head
constituent of the phrase with its
modifier spans
– Modifier Crossings involve only
spans of modifier constituents
• For dependency trees
– Head Crossings means
crossings of spans of child with
its parent
– Modifier crossings same as
above
• Dependency structures show
cohesive nature across
translation
A Path-based Transfer model
Dekang Lin 2004
• Input
– Word-aligned
– Source parsed
• Syntax translation model
– Set of paths in source tree
– Extract connected target path
• Generalization of paths to POS
• Modeling
– Relative likelihood
– Smoothing factor for noise
A Path-based Transfer model
Dekang Lin 2004
• Decoding
– Parse input and extract all paths, extract
target paths
– Find a set of transfer rules
• Cover the entire source tree
• Can be consistently merged
– Lexicalized rule preferred
– Future work?
• Word ordering is addressed
– Transfer rules from same sen: follow order
in sentence
– Only one example of path: follow order in
rule
– Many examples: pick relative distance from
head
• Highest probability
– Dynamic Programming
• Min-set cover problem applied to trees
A Path-based Transfer model
Dekang Lin 2004
• Evaluation
–
–
–
–
–
English-French: 1.2M
Source parsed by Minipar
1755 test set
5 to 15 words long sentences
Compared to Koehn’s results from 2003
paper
• No Language Model or extra generation
module
– Order defined by paths is linear
– Some heuristics to maintain linearity
• Generalization of paths (transfer rules)
quadratic vs. exponential
• Direct Correspondence Approach (DCA)
is violated when translation divergences
exist
• Very Naïve notion of reordering and
merge conflict resolution
System
BLEU
IBM4
0.2555
PBSMT
0.3149
Current
0.2612
Dependency Treelet Translation
Quirk et al ACL 2004,05,06
• Project dependencies from source to target via
word alignment
– One-one: project dependency to aligned words
– Many-one: nothing to do, as the projected is the head
– One-many : project to right most, and rest are
attached to it
• Reattachment of modifiers to lowest possible
node that preserves target word order
• Treelet extraction
– All subtrees on source until a particular limit, and the
corresponding target fragment which is connected
– MLE for scoring
Dependency Treelet Translation
Quirk et al ACL 2004,05,06
tired
men
hommes
and
et
dogs
chiens
fatigues
et
and
men
tired
dogs
hommes
et
chiens
fatigues
hommes
chiens
fatigues
Treelet with
missing roots
Dependency Treelet Translation
Quirk et al 2004,05,06
• Translation Model
– Trained from the aligned
projected corpus
– Log-linear with feature functions
• Channel Model
– Treelet Prob
– Lexical Prob
• Order Model
– Head relative
– Swap model
• Target Model
– Target language model
– Bigram Agreement model (opt)
Dependency Treelet Translation
Quirk et al ACL 2004,05,06
• Decoding (Step by step)
– Input is a dependency analyzed source
• Challenge is that left-right may not work when starting with a
Tree
– Obtain best target tree combining the models
– Exhaustive search using DP
• Translate bottom up, from a given subtree (ITG)
• For each head node extract all matching treelets: x_i
– For each uncovered subtrees extract all matching treelets: y_i
» Try all insertions of y_i into slots in x_i
» Ordering model ranks all the re-ordering possibilities for
the modifiers
Dependency Treelet Translation
Quirk et al ACL 2004,05,06
• Decoding Optimizations
–
–
–
–
–
Duplicate translations check&reuse
Nbest list (only maintain top best candidates)
Early pruning before reordering (channel model)
Greedy reordering (pick best one and move on)
Variable n-best size (dynamically reduce ‘n’ with
increasing uncovered subtrees)
– Determinstic pruning of treelets based on MLE
(allowing decoder to try more reorderings)
• A* decoding
– Estimate the cost of an uncovered node reordering
instead of computing it exactly
– Heuristics for optimistic estimates for each of the
models
Dependency Treelet Translation
Quirk et al ACL 2004,05,06
• Evaluation
– Eng-French
– 1.5M parallel Microsoft
technical documentation
– NLPWIN parsed on Eng
side
– GIZA++ trained
– Target LM: French side of
parallel data
– Tuned on 250 sens for
MaxBLEU
– Tested on 10K unseen
– 1 Reference
Improvements to Treelet Translation
• Dependency Order Templates (ACL 2007)
Improve Generality in Translation
Learn un-lexicalised order templates
Only use at runtime for restricting search
space in reordering
• Minimal Translation Units (HLT NAACL 2005)
– Bilingual n-gram channel model (Banchs
et.al 2005)
• M = <m1,m2…>
• m1 = <si, tj>
– Instead of conditioning on the surface
adjacent MTU, they condition on
Headwordchain
Dependency Tree based Machine
Translation
• By projection
– Fox 2002
– Dekang Lin 2004
– Quirk et al 2004, Quirk et al 2006, Menezes et al
2007
• By synchronous modeling
–
–
–
–
Alshawi et al 2001
Jason Eisner 2003
Yuang Lin and Daniel Marcu 2004
Fox 2005
Learning Dependency Translation Models as Collections of Finite-State
Head Transducers
Alshwai et al 2001
• Head transducers variant
– Middle-out string transduction vs. left-right
– Can be used in a hierarchical fashion, if you consider
input/output for non-head transitions as ‘strings’ rather
than ‘words’
• Dependency transduction model
Empty
in/out
May not always
be a
dependency
model in
conventional
sense
Learning Dependency Translation Models as Collections of Finite-State
Head Transducers
Alshwai et al 2001
• Training: Given unaligned bitext
–
–
–
–
Compute coocurrence statistics at wordlevel
Find a hierarchical synchronous alignment driven by cost function
Construct a set of head transducers that explain the alignment
Calculate the transition weights by MLE
• Decoding
– Similar to CKY or Chart Parsing, but ‘middle-out’
– Given input, find the best applications of transducers
– A derivation spanning entire input means it probably has found
best dependencies for source & target
– Else string together most probable partial hypothesis to form a
tree
– Pick the target tree with lowest score and read off the string
Learning Dependency Translation Models as Collections of Finite-State
Head Transducers
Alshwai et al 2001
•
Evaluation
– Eng – Spanish (ATIS data – 13,966 train, 1185 test)
– Eng – Jap (Speech transcribed data – 12,226 train, 3253 test)
•
Discussion
– Language agnostic, direction agnostic
– Induced dependency tree may not be syntactically motivated, but suited to
translation
– Application of transducers is done locally, and so less context information
– A single transducer tries to do everything, training may have sparsity problems
Learning non-isomorphic tree mappings for MT
Jason Eisner 2003
•
•
Non-Isomorphism not just due to language
divergences but free translation
A version of Tree Substitution Grammar
– To learn from unaligned non-isomorphic trees
– A statistical model based generalized instead of
linguistic minimalism
– Expressive with empty string insertions
– Formulate for both PSG and DG
•
Translation model
– Joint model P (Ts,Tt,A)
• Alignment
• Decoding
• Training
– Factorization helps:
• Reconstruct all derivations for a tree by efficient
‘tree parsing’ algorithm for TSG
• EM as an efficient inside-outside training on all
derivations
•
Decoding
– Chart Parsing to create a forest of derivations
for input tree
– Maximize over probability of derivations
– 1-best derivation parse is syntactic-alignment
1. Kids kiss Sam quite often
2. Lots of kids give kisses to Sam
Machine Translation Using Probabilistic Synchronous
Dependency Insertion Grammars
Ding and Marcu 2005
• SDIG
– Like STAG, STIG for phrase
structures
– Basic units are elementary trees
– Handles non-isomorphism at
sub-tree level
• Cross-lingual inconsistencies
are handled if they appear
within basic units
– Crossing-dependency
– Broken-dependency
Machine Translation Using Probabilistic Synchronous
Dependency Insertion Grammars
Ding and Marcu 2005
• Induction of SDIG for MT as
Synchronous hierarchical
tree partitioning
– Train IBM Mode 1 scores for
bitext
– For each category of Node,
starting with NP – Perform synchronous tree
partitioning operations
• Compute Prob of word pair
(ei,fi) where operation can be
performed
• Heuristic functions (Graphical
model) guide the partitioning
Machine Translation Using Probabilistic Synchronous
Dependency Insertion Grammars
Ding and Marcu 2005
• Translation
• Decoding for MT
– Translation is obtained by
• maximizing over all possible
derivations of the source
tree
• translation of the
‘elementary trees’
– Analogous to HMM
(Emission and Transition
probs with elementary trees)
– Decoding is similar to a
Viterbi-style algorithm on the
tree
• Hooks
– Augmenting corpus by
singleton ETs from Model1
– Smoothing probabilities
Machine Translation Using Probabilistic Synchronous
Dependency Insertion Grammars
Ding and Marcu 2005
• Evaluation
– Chinese-English system
– Dan Bikels parses for both Cn,En trained from Parallel
treebanks
– Test with 4 refs
• Compared with
– GIZA trained
– ISI Rewrite Decoder
• NIST increased 97%
• BLUE increased 27%
• Reordering ignored for now
Dependency Based Statistical MT
Fox 2005
• Czech-English parallel corpus (Penn TB and
Prague TB)
– Morphological process and tecto-grammatical
conversion for Czech trees
– No processing for English trees
• Alignment of subtrees via IBM Model4 scores
– followed by structural modification of trees to suit
alignment (KEEP,SPLIT,BUD…)
• Translation Model :
Dependency Based Statistical MT
Fox 2005
• Decoding
– Bestfirst decoder
– Process given Czech input to dependency tree and translate
each node independently
– For each node
•
•
•
•
Choose head position
Generate english POS seq
Generate the feature list
Perform structural mutations
• Syntax Language Model
– Takes as input a forest of phrase structures
– Invert decoder forest output (dep tree nodes) into phrase
structures
– Reordering is entirely left to LM
• Evaluation
– Work in progress
– Proposed to use BLEU
Today
• Introduction
– Dependency formalism
– Syntax in Machine Translation
• Dependency Tree based Machine
Translation
– By projection
– By synchronous modeling
• Conclusion and Future
• The good -
Conclusion
– Easy to work with
– Cohesive during projection
– Builds well on top of existing PBSMT (Effective combination
of lexicalization and syntax)
– Supports modeling a target even with crossing phrase
boundaries
– Gracefully degrade over new domains
• The bad –
– Reordering is not crucial, but expensive
– Lots of hooks for decoding
– Generalization explodes space
• The not so good –
– Current approaches require a dependency tree on source
side and a strong model for the target side
What Next…
• 1 year
– Better scoring and estimation in syntactic translation models
– Improvement in Dependency trees parse quality directly translates?
(Chris Quirk et al 2006) ? What about MST Parser etc?
– Better Word-Alignment and effect on model
– Incorporating labeled dependencies. Will it help?
– Factored Dependency Tree based models
– Approximate subtree matching and Parsing algorithms
• 3-5 years
– Decoding Algorithms and the Target-Ordering problem
– Discriminative approaches to MT are catching up. How can syntax be
incorporated into such a framework
– Better syntactic language models based on the dependency
formalisms
– Semantics in Translation (Are DepTrees the first step?)
– Fusion of Dependency and Constituent approaches (LFG style)
– Joint Modeling approaches (Eisner 03, Smith 06 QS Grammar)
– Taking MT to other applications like Cross-lingual Retrieval and QA
which already use DepFormalisms
Thanks to
• Lori Levin: For discussion on Dependency
tree formalism
• Amr Ahmed: For discussion and
separation of work
• Respective authors of the papers for some
of the graphic images I liberally used in the
slides
Questions
Thanks
DG Variants
•
•
•
•
•
•
•
•
•
Case Grammar (Anderson)
Daughter-Dependency Theory (Hudson)
Dependency Unification Grammar (Hellwig)
Functional-Generative Description (Sgall)
Lexicase (Starosta)
Meaning-Text Model (Mel'cuk)
Metataxis (Schubert)
Unification Dependency Grammar (Maxwell)
Constraint Dependency Grammar (Maruyama)
Motivation Questions
• 1. How is dependency analysis used in Syntax MT? How
do the algorithms vary if only the source side of analysis
is present?
• 2. How do the decoding and transfer phases adapt when
using dependency analysis? What algorithms exist and
what is the complexity analysis?
• 3. How does dependency based syntax incorporation in
MT, compare with other grammar formalisms like the
phrase structure grammar?
• 4. Is there a class of languages which yield better to
dependency analysis vs. other analysis?
• 5. Dependency analysis being close to semantics, does
it help MT produce better results?
Other Papers
• QuasiSynchronous Grammars for Soft Syntactic Projection
David Smith and Jason Eisner 2007
• Automatic Learning of Parallel Dependency Treelet Pairs
Yuan Ding and Martha Palmer 2004
• Dependency vs. Constituents for Tree-Based Alignment
Dan Gildea 2003
• My Compilation:
– http://kathmandu.lti.cs.cmu.edu:8080/wiki/index.php/AMT:Sched
ule
Download