Weighted Finite-State Transducer

Synchronous Grammars and Tree Automata David Chiang and Kevin Knight USC/Information Sciences Institute Viterbi School of Engineering University of Southern California Why Worry About Formal Language Theory? • They already figured out most of the key things back in the 1960s & 1970s • Lucky us! – – – – Helps clarify our thinking about applications Helps modeling & algorithm development Helps promote code re-use Opportunity to develop novel, efficient algorithms and data structures that bring ancient theorems to life • Formal grammar and automata theory are the daughters of natural language processing – let’s keep in touch [Chomsky 57] • Distinguish grammatical English from ungrammatical English: – – – – – – – – – – John thinks Sara hit the boy * The hit thinks Sara John boy John thinks the boy was hit by Sara Who does John think Sara hit? John thinks Sara hit the boy and the girl * Who does John think Sara hit the boy and? John thinks Sara hit the boy with the bat What does John think Sara hit the boy with? Colorless green ideas sleep furiously. * Green sleep furiously ideas colorless. This Research Program has Contributed Powerful Ideas Context-free grammar Formal language hierarchy Compiler technology This Research Program has NLP Applications Alternative speech recognition or translation outputs: Green = how grammatical Blue = how sensible This Research Program has NLP Applications Alternative speech recognition or translation outputs: Pick this one! Green = how grammatical Blue = how sensible This Research Program Has Had Wide Reach • What makes a legal RNA sequence? • What is the structure of a given RNA sequence? Yasubumi Sakakibara, Michael Brown, Richard Hughey, I. Saira Mian, Kimmen Sjolander, Rebecca C. Underwood and David Haussler. Stochastic Context-Free Grammars for tRNA, Modeling Nucleic Acids Research, 22(23):5112-5120, 1994. This Research Program is Really Unfinished! Type in your English sentence here: Is this grammatical? Is this sensible? Acceptors and Transformers • Chomsky’s program is about which utterances are acceptable • Other research programs are aimed at transforming utterances – – – – Translate a English sentence into Japanese… Transform a speech waveform into transcribed words… Compress a sentence, summarize a text… Transform a syntactic analysis into a semantic analysis… – Generate a text from a semantic representation… Strings and Trees • Early on, trees were realized to be a useful tool in describing what is grammatical – A sentence is a noun phrase (NP) followed by a verb phrase (VP) – A noun phrase is a determiner (DT) followed by a noun (NN) – A noun phrase is a noun phrase (NP) followed by a prepositional phrase (PP) – A PP is a preposition (IN) followed by an NP S NP NP VP PP IN NP • A string is acceptable if it has an acceptable tree … • Transformations may take place at the tree level … Natural Language Processing • 1980s: Many tree-based grammatical formalisms • 1990s: Regression to string-based formalisms – Hidden Markov Models (HMM), Finite-State Acceptors (FSAs) and Transducers (FSTs) – N-gram models for accepting sentences [eg, Jelinek 90] – Taggers and other statistical transformations [eg, Church 88] – Machine translation [eg, Brown et al 93] – Software toolkits implementing generic weighted FST operations [eg, Mohri, Pereira, Riley 00] WFSA w WFST e WFST j WFST k “backwards application of string k through a composition of a transducer cascade, intersected with a weighted FSA language model” Natural Language Processing • 2000s: Emerging interest in tree-based probabilistic models – Machine translation • [Wu 97, Yamada & Knight 02, Melamed 03, Chiang 05, …] – Summarization • [Knight & Marcu 00, …] – Paraphrasing • [Pang et al 03, …] – Question answering • [Echihabi & Marcu 03, …] – Natural language generation • [Bangalore & Rambow 00, …] What are the conceptual tools to help us get a grip on encoding and exploiting knowledge about linguistic tree transformations?  Goal of this tutorial Part 1: Introduction Part 2: Synchronous Grammars < break > Part 3: Tree Automata Part 4: Conclusion Part 1: Introduction Part 2: Synchronous Grammars < break > Part 3: Tree Automata Part 4: Conclusion Part 1: Introduction Part 2: Synchronous Grammars < break > Part 3: Tree Automata Part 4: Conclusion Tree Automata • David talked about synchronous grammars Output tree 1 Output tree 2 synchronized • I’ll talk about tree automata Input tree Output tree Steps to Get There Strings Trees Grammars Regular grammar ? Acceptors Finite-state acceptor (FSA), PDA ? Transducers String Pairs Tree Pairs (inputoutput) (inputoutput) Finite-state transducer ? (FST) Context-Free Grammar • Example: – – – – – – – – – S  NP VP NP  DET N NP  NP PP PP  P NP VP  V NP DET  the N  boy V  saw P  with Generative Process: [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=0.4] [p=1.0] [p=1.0] [p=1.0] [p=1.0] • Defines a set of strings • Language described by CFG can also be described by a push-down acceptor S Context-Free Grammar • Example: – – – – – – – – – S  NP VP NP  DET N NP  NP PP PP  P NP VP  V NP DET  the N  boy V  saw P  with Generative Process: [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=0.4] [p=1.0] [p=1.0] [p=1.0] [p=1.0] • Defines a set of strings • Language described by CFG can also be described by a push-down acceptor S NP VP Context-Free Grammar • Example: – – – – – – – – – S  NP VP NP  DET N NP  NP PP PP  P NP VP  V NP DET  the N  boy V  saw P  with Generative Process: [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=0.4] [p=1.0] [p=1.0] [p=1.0] [p=1.0] • Defines a set of strings • Language described by CFG can also be described by a push-down acceptor S NP NP VP PP Context-Free Grammar • Example: – – – – – – – – – S  NP VP NP  DET N NP  NP PP PP  P NP VP  V NP DET  the N  boy V  saw P  with Generative Process: [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=0.4] [p=1.0] [p=1.0] [p=1.0] [p=1.0] • Defines a set of strings • Language described by CFG can also be described by a push-down acceptor S NP NP DET N the boy VP PP Context-Free Grammar • Example: – – – – – – – – – S  NP VP NP  DET N NP  NP PP PP  P NP VP  V NP DET  the N  boy V  saw P  with Generative Process: [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=0.4] [p=1.0] [p=1.0] [p=1.0] [p=1.0] • Defines a set of strings • Language described by CFG can also be described by a push-down acceptor S NP NP DET N the boy VP PP P NP … Context-Free Grammar • Example: – – – – – – – – – S  NP VP NP  DET N NP  NP PP PP  P NP VP  V NP DET  the N  boy V  saw P  with Generative Process: [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=0.4] [p=1.0] [p=1.0] [p=1.0] [p=1.0] • Defines a set of strings • Language described by CFG can also be described by a push-down acceptor S NP NP VP PP P V DET N NP the boy with DET N NP saw DET N the boy the boy Context-Free Grammar • Is this what lies behind modern parsers like Collins and Charniak? • No… they do not have a finite list of productions, but an essentially infinite list. – “Markovized” grammar • Generative process is head-out: S VP Ph(VP | S) Context-Free Grammar • Is this what lies behind modern parsers like Collins and Charniak? • No… they do not have a finite list of productions, but an essentially infinite list. – “Markovized” grammar • Generative process is head-out: S NP VP Pleft(NP | VP, S) Context-Free Grammar • Is this what lies behind modern parsers like Collins and Charniak? • No… they do not have a finite list of productions, but an essentially infinite list. – “Markovized” grammar • Generative process is head-out: S PP NP VP Pleft(PP | VP, S) Context-Free Grammar • Is this what lies behind modern parsers like Collins and Charniak? • No… they do not have a finite list of productions, but an essentially infinite list. – “Markovized” grammar • Generative process is head-out: S STOP PP NP VP Pleft(STOP | VP, S) Context-Free Grammar • Is this what lies behind modern parsers like Collins and Charniak? • No… they do not have a finite list of productions, but an essentially infinite list. – “Markovized” grammar • Generative process is head-out: S STOP PP NP VP ADVP Context-Free Grammar • Is this what lies behind modern parsers like Collins and Charniak? • No… they do not have a finite list of productions, but an essentially infinite list. – “Markovized” grammar • Generative process is head-out: S STOP PP NP VP ADVP STOP Context-Free Grammar • Is this what lies behind modern parsers like Collins and Charniak? • No… they do not have a finite list of productions, but an essentially infinite list. – “Markovized” grammar • Generative process is head-out: S STOP PP NP VP VBD ADVP STOP Context-Free Grammar • Is this what lies behind modern parsers like Collins and Charniak? • No… they do not have a finite list of productions, but an essentially infinite list. – “Markovized” grammar • Generative process is head-out: S STOP PP NP VP ADVP STOP VBD NP Context-Free Grammar • Is this what lies behind modern parsers like Collins and Charniak? • No… they do not have a finite list of productions, but an essentially infinite list. – “Markovized” grammar • Generative process is head-out: S STOP PP NP VP ADVP STOP … VBD NP NP … ECFG [Thatcher 67] • Example: – S  ADVP* NP VP PP* – VP  VBD NP [NP] PP* • Defines a set of strings • Can model Charniak and Collins: – VP  (ADVP | PP)* VBD (NP | PP)* • Can even model probabilistic version: PP : Pleft(PP|VP) VP  *e*: Phead(VBD|VP) *e* : Pleft(STOP|VP) ADVP : Pleft(ADVP|VP) VBD:1.0 PP *e* : Pright(STOP|VP) NP ECFG [Thatcher 67] • Is ECFG more powerful than CFG? • It is possible to build a CFG that generates the same set of strings as your ECFG – As long as right-hand side of every rule is regular – Or even context-free! (see footnote, [Thatcher 67]) • BUT: – the CFG won’t generate the same derivation trees as the ECFG ECFG [Thatcher 67] • For example: – ECFG can (of course) accept the string language a* – Therefore, we can build a CFG that also accepts a* – But ECFG can do it via this set of derivation trees, if so desired: S S , a S , a a S , aaa S , aa a a … , aaaaa Tree Generating Systems • Sometimes the trees are important – Parsing, RNA structure prediction – Tree transformation • We can view CFG or ECFG as a tree-generating system, if we squint hard … • Or, we can look at tree-generating systems – Regular tree grammar (RTG) is standard in automata literature • Slightly more powerful than tree substitution grammar (TSG) • Less powerful than tree-adjoining grammar (TAG) – Top-down tree acceptors (TDTA) are also standard What We Want • Device or grammar D for compactly representing possibly-infinite set S of trees (possibly with weights) • Want to support operations like: – Membership testing: Is tree X in the set S? – Equality testing: Do sets described by D1 and D2 contain exactly the same trees? – Intersection: Compute (possibly-infinite) set of trees described by both D1 and D2 – Weighted tree-set intersection, e.g.: • D1 describes a set of candidate English translations of some Chinese sentence, including disfluent ones • D2 describes a general model of fluent English Regular Tree Grammar (RTG) Generative Process: • Example: – – – – – – – – q  S(qnp, VP(V(run))) qnp  NP(qdet, qn) qnp  NP(qnp, qpp) qpp  PP(qprep, qnp) qdet  DET(the) qprep  PREP(of) qn  N(sons) qn  N(daughters) • Defines a set of trees [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=1.0] [p=1.0] [p=0.5] [p=0.5] q Regular Tree Grammar (RTG) Generative Process: • Example: – – – – – – – – q  S(qnp, VP(V(run))) qnp  NP(qdet, qn) qnp  NP(qnp, qpp) qpp  PP(qprep, qnp) qdet  DET(the) qprep  PREP(of) qn  N(sons) qn  N(daughters) • Defines a set of trees [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=1.0] [p=1.0] [p=0.5] [p=0.5] S qnp VP V run Regular Tree Grammar (RTG) Generative Process: • Example: – – – – – – – – q  S(qnp, VP(V(run))) qnp  NP(qdet, qn) qnp  NP(qnp, qpp) qpp  PP(qprep, qnp) qdet  DET(the) qprep  PREP(of) qn  N(sons) qn  N(daughters) • Defines a set of trees [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=1.0] [p=1.0] [p=0.5] [p=0.5] S NP qnp VP qpp V run Regular Tree Grammar (RTG) Generative Process: • Example: – – – – – – – – q  S(qnp, VP(V(run))) qnp  NP(qdet, qn) qnp  NP(qnp, qpp) qpp  PP(qprep, qnp) qdet  DET(the) qprep  PREP(of) qn  N(sons) qn  N(daughters) • Defines a set of trees [p=1.0] [p=0.7] [p=0.3] [p=1.0] [p=1.0] [p=1.0] [p=0.5] [p=0.5] S NP qpp NP qdet VP qn V run Regular Tree Grammar (RTG) • Example: – – – – – – – – q  S(qnp, VP(V(run))) qnp  NP(qdet, qn) qnp  NP(qnp, qpp) qpp  PP(qprep, qnp) qdet  DET(the) qprep  PREP(of) qn  N(sons) qn  N(daughters) • Defines a set of trees Generative Process: S [p=1.0] [p=0.7] NP [p=0.3] [p=1.0] qpp NP [p=1.0] [p=1.0] DET qn [p=0.5] [p=0.5] the VP V run Regular Tree Grammar (RTG) • Example: – – – – – – – – q  S(qnp, VP(V(run))) qnp  NP(qdet, qn) qnp  NP(qnp, qpp) qpp  PP(qprep, qnp) qdet  DET(the) qprep  PREP(of) qn  N(sons) qn  N(daughters) • Defines a set of trees Generative Process: S [p=1.0] [p=0.7] NP [p=0.3] [p=1.0] qpp NP [p=1.0] [p=1.0] DET N [p=0.5] [p=0.5] the sons VP V run Regular Tree Grammar (RTG) • Example: – – – – – – – – q  S(qnp, VP(V(run))) qnp  NP(qdet, qn) qnp  NP(qnp, qpp) qpp  PP(qprep, qnp) qdet  DET(the) qprep  PREP(of) qn  N(sons) qn  N(daughters) • Defines a set of trees Generative Process: S [p=1.0] [p=0.7] NP [p=0.3] [p=1.0] PP NP [p=1.0] [p=1.0] NP DET N PREP [p=0.5] [p=0.5] the sons of DET the VP V run N daughters P(t) = 1.0 x 0.3 x 0.7 x 0.5 x 0.7 x 0.5 Relation Between RTG and CFG • For every CFG, there is an RTG that directly generates its derivation trees – TRUE (just convert the notation) • For every RTG, there is a CFG that generates the same trees in its derivations – FALSE { NP NN NP NN killer clown , NN NN clown killer but reject “clown clown” } RTG: q  NP(NN(clown) NN(killer)) q  NP(NN(killer) NN(clown)) No CFG possible. Relation Between RTG and TSG • For every TSG, there is an RTG that directly generates its trees – TRUE (just convert the notation) • For every RTG, there is an TSG that generates the same trees – FALSE • Using states, an RTG can accept all trees (over symbols a and b) that contain exactly one a Relation of RTG to Johnson [98] Syntax Model TOP S NP VP P(PRO | NP:S) = 0.21 PRO VB NP P(PRO | NP:VP) = 0.03 PRO Relation of RTG to Johnson [98] Syntax Model TOP S:TOP NP:S VP:S P(PRO | NP:S) = 0.21 PRO:NP VB:VP NP:VP P(PRO | NP:VP) = 0.03 PRO:NP Relation of RTG to Johnson [98] Syntax Model RTG: qstart  S(q.np:s, q.vp:s) q.np:s  NP(q.pro:np) q.np:vp  NP(q.pro:np) q.pro:np  PRO(q.pro) q.pro:np  he q.pro:np  she q.pro:np  him q.pro.np  her S p=0.21 NP VP p=0.03 PRO VB NP P(PRO | NP:VP) = 0.03 PRO Relation of RTG to Lexicalized Syntax Models • Collins’ model assigns P(t) to every tree • Can weighted RTG (wRTG) implement P(t) for language modeling? – Just like a WFSA can implement smoothed n-gram language model… • Something like yes. – States can encode relevant context that is passed up and down the tree. – Technical problems: • “Markovized” grammar (Extended RTG?) • Some models require keeping head-word information, and if we back off to a state that forgets the head-word, we can’t get it back. Something an RTG Can’t Do {b, b b , a a a a bbb b , a a a a a ,… a a bbbbbbbb n Note also that the yield language is {b2 : n  1}, which is not context-free. } Language Classes in String World and Tree World … indexed languages String World CFL Regular Languages … yield yield CFTL RTL Tree World Picture So Far Strings Trees Grammars Regular grammar, CFG DONE (RTG) Acceptors Finite-state acceptor (FSA), PDA Transducers ? String Pairs Tree Pairs (inputoutput) (inputoutput) Finite-state transducer ? (FST) FSA Analog: Top-Down Tree Acceptor [Rabin, 1969; Doner, 1970] RTG: q  NP(NN(clown) NN(killer)) q  NP(NN(killer) NN(clown)) TDTA: q NP  q2 q3 q NP  q3 q2 q2 NN  q2 q2 clown  accept q3 killer  accept q3 NN  q3 q NP NN NN clown killer NP NP q2 NN q3 NN clown killer NN NP NN q2 clown q3 killer NN NN clown killer accept accept For any RTG, there is a TDTA that accepts the same trees. For any TDTA, there is an RTG that accepts the same trees. FSA Analog: Bottom-Up Tree Acceptor [Thatcher & Wright 68; Doner 70] • A similar story for bottom-up acceptors… • To summarize tree acceptor automata and RTGs: – – – – Tree acceptors are the analogs of string FSAs They are often used in proofs The visual appeal is not as great at FSAs People often prefer to read and write RTGs Properties of Language Classes String Sets Tree Sets RSL CFL RTL (FSA, rexp) (PDA, CFG) (TDTA, RTG) Closed under union YES YES YES Closed under intersection YES NO YES Closed under complement YES NO YES Membership testing O(n) O(n3) O(n) Emptiness decidable? YES YES YES Equality decidable? YES NO YES (references: Hopcroft & Ullman 79, Gécseg & Steinby 84) Picture So Far Strings Trees Grammars Regular grammar, CFG DONE (RTG) Acceptors Finite-state acceptor (FSA), PDA Transducers DONE (TDTA) String Pairs Tree Pairs (inputoutput) (inputoutput) Finite-state transducer ? (FST) Transducers String Transducer Tree Transducer FST compactly represents a possibly-infinite set of string pairs Tree transducers compactly represent a possiblyinfinite set of tree pairs Probabilistic FST assigns P(s2 | s1) to every string pair Probabilistic tree transducer assigns P(t2 | t1) to every tree pair Can ask: What’s the best transformation of input string s1? Can ask: What’s the best transformation of tree t1? Example 1: Machine Translation [Yamada & Knight 01] Parse Tree(E) S NP S VB1 he VB2 adores Reorder VB listening he P NN to music PP P music to PP VB to music Translate ga S NP kare adores desu listening no adores listening P NN VB1 VB NN VB1 Insert VB2 ha VB2 he PP S NP NP VB2 ha PP NN ongaku VB1 VB ga daisuki desu P wo kiku no Take Leaves . Sentence(J) Kare ha ongaku wo kiku no ga daisuki desu Example 2: Sentence Compression [Knight & Marcu 00] S S VP NP VP VB PRP PRP adores he VP NP VB listening PP P adores NP to he VP VB VB listening P to NN music PP PP NP JJ NN good music P NP at NN home What We Want • Device T for compactly representing possiblyinfinite set of tree pairs (possibly with weights) • Want to support operations like: – Forward application: Send input tree X through T, get all possible output trees, perhaps represented as an RTG. – Backward application: What input trees, when sent through T, would output tree X? – Composition: Given a T1 and T2, can we build a transducer T3 that does the work of both in sequence? – Equality testing: Do T1 and T2 contain exactly the same tree pairs? – Training: How to set weights for T given a training corpus of input/output tree pairs? Finite-State Transducer (FST) Original input: Transformation: k q k  q2 ε q i  q AY q g  q3 ε q2 n  q N q3 h  q4 ε q4 t  qfinal T n i g k: i : AY h q g: t q FST ε ε q3 n i g q2 n:N qfinal h: ε q4 k h t:T t Finite-State (String) Transducer Original input: Transformation: k FST q k  q2 ε q i  q AY q g  q3 ε q2 n  q N q3 h  q4 ε q4 t  qfinal T n i g k: i : AY h q g: t ε ε g qfinal h: ε q4 n i q2 n:N q3 q2 h t:T t Finite-State (String) Transducer Original input: Transformation: k FST q k  q2 ε q i  q AY q g  q3 ε q2 n  q N q3 h  q4 ε q4 t  qfinal T n i g k: i : AY h q g: t ε ε q qfinal h: ε q4 i g q2 n:N q3 N h t:T t Finite-State (String) Transducer Original input: Transformation: k FST q k  q2 ε q i  q AY q g  q3 ε q2 n  q N q3 h  q4 ε q4 t  qfinal T n i g k: i : AY h q g: t ε ε AY q q2 n:N q3 N qfinal h: ε q4 g h t:T t Finite-State (String) Transducer Original input: Transformation: k FST q k  q2 ε q i  q AY q g  q3 ε q2 n  q N q3 h  q4 ε q4 t  qfinal T n i g k: i : AY h q g: t ε ε AY q2 n:N q3 N qfinal h: ε q4 q3 h t:T t Finite-State (String) Transducer Original input: Transformation: k FST q k  q2 ε q i  q AY q g  q3 ε q2 n  q N q3 h  q4 ε q4 t  qfinal T n i g k: i : AY h q g: t ε ε AY q2 n:N q3 N qfinal h: ε q4 t:T q4 t Finite-State (String) Transducer Original input: Transformation: k FST q k  q2 ε q i  q AY q g  q3 ε q2 n  q N q3 h  q4 ε q4 t  qfinal T n i g k: i : AY h q g: t ε ε AY q2 n:N q3 N qfinal q4 h:ε t:T T qfinal Finite-State (String) Transducer Original input: k n i g Final output: FST q k  q2 ε q i  q AY q g  q3 ε q2 n  q N q3 h  q4 ε q4 t  qfinal T N AY h T t Trees Original input: Target output: S NP PRO he S VP VBZ enjoys NP NP ? S PRO PN NP VP kare wa SBAR PN VB PV SBAR PS ga daisuki desu S VBG VP listening P NP NP NN PN kiku ongaku o to music VB no Tree Transformation Original input: Transformation: S NP qS VP NP PRO VBZ NP he enjoys SBAR VBG ? VP VP PRO VBZ NP he enjoys SBAR VBG VP listening P NP listening P NP to music to music Tree Transformation Original input: Transformation: S NP S VP q.np NP PRO VBZ NP he enjoys SBAR VBG ? S PRO q.np NP he VP q.vbz VBZ SBAR VBG enjoys VP listening P NP listening P NP to music to music Trees Original input: Transformation: S NP PRO he S VP VBZ enjoys S NP NP ? PRO PN q.np NP kare wa SBAR VBG q.vbz VBZ SBAR VP VBG enjoys VP listening P NP listening P NP to music to music Trees Original input: Final output: S NP PRO he S VP VBZ enjoys NP NP ? S PRO PN NP VP kare wa SBAR PN VB PV SBAR PS ga daisuki desu S VBG VP listening P NP NP NN PN kiku ongaku o to VB no music qfinal Top-Down Tree Transducer • Introduced by Rounds (1970) & Thatcher (1970) “Recent developments in the theory of automata have pointed to an extension of the domain of definition of automata from strings to trees … parts of mathematical linguistics can be formalized easily in a tree-automaton setting … Our results should clarify the nature of syntax-directed translations and transformational grammars …” (Mappings on Grammars and Trees, Math. Systems Theory 4(3), Rounds 1970) • “FST for trees” • Large literature – e.g., Gécseg & Steinby (1984), Comon et al (1997) • Many classes of transducers – Top-down (R), bottom-up (F), copying, deleting, lookahead… Top-Down Tree Transducer Original Input: Target Output: R (“root to frontier”) transducer S PRO he X ? VP VB NP likes spinach VP VB PRO AUX NP likes spinach he does Top-Down Tree Transducer Original Input: Transformation: R (“root to frontier”) transducer q S x0 x1  X q x1 q x0 AUX does S PRO VP qS PRO VP Or in computer-speak: he VB NP likes spinach q S(x0, x1)  X(q x1, q x0, AUX(does)) he VB NP likes spinach Top-Down Tree Transducer Original Input: Transformation: R (“root to frontier”) transducer q S x0 x1  X q x1 q x0 AUX does S PRO VP X q VP q PRO AUX Or in computer-speak: he VB NP likes spinach q S(x0, x1)  X(q x1, q x0, AUX(does)) VB NP likes spinach he does Relation of Tree Transducers to (String) FSTs • Tree transducers generalize FSTs (strings are “monadic trees”) FST transition q q A/B A/*e* r r Equivalent tree transducer rule qA B x0 r x0 qA r x0 x0 q *e*/B r qx B rx Tree Transducer Simulating FST Original input: k n Transformation: R (“root to frontier”) transducer q k  q2 x0 q k n x0 i g qn N x0 q x0 i g h h t t Tree Transducer Simulating FST Original input: k n Transformation: R (“root to frontier”) transducer q k  q2 x0 q2 n x0 i g qn N x0 q x0 i g h h t t Tree Transducer Simulating FST Original input: k n Transformation: R (“root to frontier”) transducer q k  q2 x0 N x0 i g qn N x0 q x0 h q i g h (and so on …) t t Top-Down Transducers can Copy and Delete sentence compression VP VB R transducer that deletes q VP PP calculus d sin(y) = cos(y) · d y d sin y x0 VP VP  x1 q VB q x0 R transducer that copies d sin x0 MULT MULT  cos d x0 i x0 Example from Rounds (70) cos i y d y Complex Re-Ordering Original Input: R transducer q S  ? Target Output: x0 x1 S S PRO VB VB VP NP PRO NP Complex Re-Ordering Original Input: R transducer q S x0 x1 qleft VP S PRO qright VP VB S  qleft x1  q x0 x0 x1 VP Transformation: qright x1 q S PRO  q x1 q PRO  PRO q VB  VB q NP  NP NP q x0 x0 x1 VB VP NP Complex Re-Ordering Original Input: R transducer q S x0 x1 qleft VP S PRO qright VP VB S  qleft x1  qright x1 S qleft VP q PRO qright VP VB  q x1 q PRO  PRO q VB  VB q NP  NP NP q x0 q x0 x0 x1 VP Transformation: x0 x1 NP VB NP Complex Re-Ordering Original Input: R transducer q S x0 x1 qleft VP S PRO qright VP VB S  qleft x1  q x0 qright x1 S VB  q x1 q PRO  PRO q VB  VB q NP  NP NP q x0 q VB q PRO qright VP x0 x1 VP Transformation: x0 x1 NP Complex Re-Ordering Original Input: R transducer q S x0 x1 qleft VP S PRO qright VP VB S  qleft x1  q x0 qright x1 S VB  q x1 q PRO  PRO q VB  VB q NP  NP NP q x0 VB q PRO qright VP x0 x1 VP Transformation: x0 x1 NP Complex Re-Ordering Original Input: R transducer q S x0 x1 qleft VP S PRO qright VP VB S  qleft x1  qright x1 S VB PRO qright VP VB  q x1 q PRO  PRO q VB  VB q NP  NP NP q x0 q x0 x0 x1 VP Transformation: x0 x1 NP Complex Re-Ordering Original Input: R transducer q S x0 x1 qleft VP S PRO qright VP VB S  qleft x1  qright x1 S VB  q x1 q PRO  PRO q VB  VB q NP  NP NP q x0 q x0 x0 x1 VP Transformation: x0 x1 PRO q NP Complex Re-Ordering Original Input: R transducer q S x0 x1 qleft VP S PRO qright VP VB S  qleft x1  qright x1 S VB  q x1 q PRO  PRO q VB  VB q NP  NP NP q x0 q x0 x0 x1 VP Final Output: x0 x1 PRO NP Extended Left-Hand Side Transducer: xR Original Input: xR transducer q S q x1 q x0 q x2 S x1:VB x2:NP S VB S  x0:PRO VP PRO Final Output: q PRO  PRO q VB  VB q NP  NP VP NP Mentioned already in Section 4, Rounds 1970. Not defined or used in proofs there. See [Knight & Graehl 05]. VB PRO NP Tree-to-String Transducers: xRS Syntax-directed translation for compilers [Aho & Ullman 71] Original input: Transformation: xRS transducer S q S NP  VP x0:NP PRO VBZ NP he enjoys SBAR VBG NP VP VP x1:VBZ x2:NP VP q x0 , q x2 , q x1 qS PRO VBZ NP he enjoys SBAR VBG VP listening P NP listening P NP to music to music Tree-to-String Transducers: xRS Syntax-directed translation for compilers [Aho & Ullman 71] Original input: xRS transducer S NP q S VP x0:NP PRO Transformation: VBZ  q x0 , q x2 , q x1 VP NP x1:VBZ x2:NP he enjoys SBAR VBG q.np NP q.np NP PRO VP he listening P NP to music , q.vbz VBZ SBAR VBG , VP listening P NP to music enjoys Tree-to-String Transducers: xRS Syntax-directed translation for compilers [Aho & Ullman 71] Original input: xRS transducer S NP Transformation: q S VP x0:NP PRO VBZ NP he enjoys SBAR VBG  VP x1:VBZ x2:NP VP q x0 , q x2 , q x1 q.np NP kare , wa , SBAR VBG listening P NP to music q.vbz VBZ , VP listening P NP to music enjoys Tree-to-String Transducers: xRS Syntax-directed translation for compilers [Aho & Ullman 71] Original input: xRS transducer S NP Final output: VP q S x0:NP PRO VBZ NP he enjoys SBAR VBG  q x0 , q x2 , q x1 VP x1:VBZ x2:NP kare ,wa , ongaku,o ,kiku , no, ga, daisuki, desu VP listening P NP to music Used in a practical MT system [Galley et al 04]. Are Tree Transducers Expressive Enough for Natural Language? • Can published tree-based probabilistic models be cast in this framework? – This was previously done for string-based models, e.g. • [Knight & Al-Onaizan 98] – word-based MT as FST • [Kumar & Byrne 03] – phrase-based MT as FST • Can published tree-based probabilistic models be naturally extended in this framework? Are Tree Transducers Expressive Enough for Natural Language? Parse (E) S NP S VB1 he VB2 adores Reorder VB listening he P NN to music PP P music to PP VB to music Translate ga listening no S NP kare adores desu Take Leaves adores listening P NN VB1 VB NN VB1 Insert VB2 ha VB2 he PP S NP NP [Yamada & Knight 01] VB2 ha PP NN ongaku VB1 VB ga daisuki desu P wo kiku no . Sentence (J) Kare ha ongaku wo kiku no ga daisuki desu Are Tree Transducers Expressive Enough for Natural Language? Parse (E) S NP S VB1 he VB2 adores Reorder VB NP he PP P music to to See [Graehl & Knight 04] VB1 VB NN adores listening P music music VB2 ha VB Can PbeNNcast as a single 4-state tree transducer. Insert to NP VB1 PP NN listening S VB2 he PP [Yamada & Knight 01] Translate ga listening no NP kare adores desu Take Leaves S VB2 ha PP NN ongaku VB1 VB ga daisuki desu P wo kiku no . Sentence(J) Kare ha ongaku wo kiku no ga daisuki desu Extensions Possible within the Transducer Framework Phrasal Translation VP S está, cantando PRO VBZ VBG Non-contiguous Phrases VP hay, x0 VP there VB singing is Non-constituent Phrases x0:NP poner, x0 VB x0:NP PRT put on are Context-Sensitive Word Insertion NPB DT the x0:NNS x0 Multilevel Re-Ordering NP S x0:NP VP x1:VB Lexicalized Re-Ordering x1, x0, x2 x2:NP2 PP x0:NP P x1:NP of x1, , x0 Extensions Possible within the Transducer Framework Phrasal Translation VP Non-constituent Phrases S está, cantando PRO VBZ VBG Non-contiguous Phrases VP hay, x0 VP poner, x0 VB x0:NP PRT there VB Practical MTx0:NP system on put [Galley et al 04,are06; Marcu et al 06] uses 40m such Context-Sensitive Multilevel Re-Ordering Lexicalized Re-Ordering automatically collected rules Word Insertion singing is NPB DT the x0:NNS x0 NP S x0:NP VP x1:VB x1, x0, x2 x2:NP2 PP x0:NP P x1:NP of x1, , x0 Limitations of the Top-Down Transducer Model  Who does John think Mary believes I saw? John thinks Mary believes I saw who? qS WhNP Who S VP NP AUX does S NP John * VP V Mary V qwho S thinks NP S think NP John VP Mary VP V V S believes NP S believes NP I VP VP V saw I VP V saw Limitations of the Top-Down Transducer Model  Who does John think Mary believes I saw? John thinks Mary believes I saw who? qS WhNP Who S VP NP AUX does S NP John * VP V Mary V S thinks NP S think NP John VP Mary VP V V S believes NP qwhoVP S believes NP I VP VP V saw I V saw Limitations of the Top-Down Transducer Model  Who does John think Mary believes I saw? John thinks Mary believes I saw who? qS WhNP Who S VP NP AUX does S NP John * VP V Mary V S thinks NP S think NP John VP Mary VP V V S believes NP S believes NP I VP VP V saw I VP V WhNP saw who Limitations of the Top-Down Transducer Model Whose blue dog does John think Mary believes I saw?  John thinks Mary believes I saw whose blue dog? qS WhNP DT ADJ N S VP NP AUX does S NP * John VP Whose blue dog John V Mary V S thinks NP S think NP VP Mary VP V V S believes NP S believes NP I VP I VP V saw Can’t do this VP V saw WhNP DT ADJ N whose blue dog Tree Transducers: A Digression R copying + copy (before Nprocess) + complex re-order non-copying RL + delete deleting non-deleting RLN | FLN + branching LHS rules WFST Tree Transducers: A Digression xR + finite-check before delete + copy (before nprocess) R xRL + delete + finiite-check before delete xRLN + finite-check before delete + complex re-order + complex re-order copying + copy (before Nprocess) + complex re-order non-copying RL + delete deleting non-deleting RLN | FLN + branching LHS rules WFST Tree Transducers: A Digression RR + reg-check before delete xR + finite-check before delete + copy (before nprocess) R xRL + delete + finiite-check before delete xRLN to R RD RT + finite-check before delete + complex re-order + complex re-order copying + copy (before Nprocess) + complex re-order RL + delete deleting non-deleting RN RLN | FLN RTD non-copying + branching LHS rules WFST Tree Transducers: A Digression PDTT R2 F2 + reg-check before delete + nprocess before copy RR + reg-check before delete xR + copy (before nprocess) + delete + finiite-check before delete xRLN RD RT + nprocess (before copy) + complex re-order R xRL to R F + finite-check before delete + finite-check before delete + complex re-order + complex re-order + copy (before Nprocess) + complex re-order RLR | FL non-copying + non-determinism + reg-check before delete FLD RL + delete + reg-check before delete + delete deleting non-deleting RN RLN | FLN RTD copying + branching LHS rules WFST Tree Transducers: A Digression PDTT R2 Bold circles = Closed under Composition F2 + reg-check before delete + nprocess before copy RR + reg-check before delete xR + copy (before nprocess) + delete + finiite-check before delete xRLN RD RT + nprocess (before copy) + complex re-order R xRL to R F + finite-check before delete + finite-check before delete + complex re-order + complex re-order + copy (before Nprocess) + complex re-order RLR | FL non-copying + non-determinism + reg-check before delete FLD RL + delete + reg-check before delete + delete deleting non-deleting RN RLN | FLN RTD copying + branching LHS rules WFST Top Down Tree Transducers are Not Closed Under Composition [Rounds 70] Z X | Y | X | … | a R that changes X’s and Y’s to X’s and Y’s, nondeterministically No problem! Y | X | X | … | a R that duplicates input tree under new label Z No problem! / Y | X | X | … | a \ Y | X | X | … | a Top Down Tree Transducers are Not Closed Under Composition [Rounds 70] Z X | Y | X | … | a R that changes X’s and Y’s to X’s and Y’s, nondeterministically, then duplicates resulting tree under new label Z Impossible for R / Y | X | X | … | a \ Y | X | X | … | a NOTE: Deterministic R-transducers are composable. Properties of Transducer Classes String Transducers Tree Transducers FST R RL RLN=FLN F FL Closed under composition YES NO NO YES NO YES Closed under intersection NO NO NO NO NO NO Image of tree (or finite forest) is: RSL RTL RTL RTL Not RTL RTL Image of RTL is: RSL Not RTL RTL RTL Not RTL RTL Inverse image of tree is: RSL RTL RTL RTL RTL RTL Inverse image of RTL is: RSL RTL RTL RTL RTL RTL Efficiently trainable YES YES YES YES (references: Hopcroft & Ullman 79, Gécseg & Steinby 84, Baum et al 70, Graehl & Knight 04) K-best Trees & Determinization • String sets (FSA) – [Dijkstra 59] gives best path through FSA • O(e + n log n) – [Eppstein 98] gives k-best paths through FSA, including cycles • O(e + n log n + k log k) /* to print k-best lengths */ – [Mohri & Riley 02] give k-best unique strings (determinization followed by k-best paths) • Tree sets (RTG) – [Knuth 77] gives best derivation tree from CFG (easily adapted to RTG) • O(t + n log n) running time – [Huang & Chiang 05] give k-best derivations in RTG – [May & Knight 05] give k-best unique trees (determinization followed by k-best derivations) The Training Problem [Graehl & Knight 04] • Given: – an xR transducer with a set of non-deterministic rules (r1…rm) – a training corpus of input-tree/output-tree pairs • Produce: – conditional probabilities p1…pm that maximize P(output trees | input trees) = Πi,o P(o | i) input/output trees <i, o> = Πi,o Σd P(d, o | i) derivations d mapping i to o = Πi,o Σd Πr pr rules r in derivation d Derivations in R R Transducer rules: Input tree: Derivations: q A x0 B x1 R E F rule2 1.0 rule2 x0 x0 x0 S W X rule4 rule6 q x0 x1 rule2 rule7 q x1 (total weight = 0.02) Packed Derivations: T 0.4 rule4 A (total weight = 0.27) T q C Output tree: U rule8 rule1 0.6 rule3 T rule5 U x1 q C R q x0 X G q B V rule3 S C q x1 D rule1 A 1.0 rule1 A q x1 x1 rule5 q F 0.9 V rule6 q F 0.1 W rule7 q G 0.5 V rule8 q G 0.5 W q x0 qstart 1.0 q.1.12 1.0 q.2.11 0.6 q.2.11 0.4 q.21.111 0.9 q.22.112 0.5 q.21.112 0.1 q.22.111 0.5 rule1(q.1.12, q.2.11) rule2 rule3(q.21.111, q.22.112) rule4(q.21.112, q.22.111) rule5 rule8 rule6 rule7 Derivations in R R Transducer rules: Input tree: Derivations: q A x0 B x1 R E F rule2 1.0 rule2 x0 x0 x0 S W X rule4 rule6 q x0 x1 rule2 rule7 q x1 (total weight = 0.02) Packed Derivations: T 0.4 rule4 A (total weight = 0.27) T q C Output tree: U rule8 rule1 0.6 rule3 T rule5 U x1 q C R q x0 X G q B V rule3 S C q x1 D rule1 A 1.0 rule1 A q x1 x1 rule5 q F 0.9 V rule6 q F 0.1 W rule7 q G 0.5 V rule8 q G 0.5 W q x0 qstart 1.0 q.1.12 1.0 q.2.11 0.6 q.2.11 0.4 q.21.111 0.9 q.22.112 0.5 q.21.112 0.1 q.22.111 0.5 rule1(q.1.12, q.2.11) rule2 rule3(q.21.111, q.22.112) rule4(q.21.112, q.22.111) rule5 rule8 rule6 rule7 Derivations in R R Transducer rules: Input tree: Derivations: q A x0 B x1 R E F rule2 1.0 rule2 x0 x0 x0 S W X rule4 rule6 q x0 x1 rule2 rule7 q x1 (total weight = 0.02) Packed Derivations: T 0.4 rule4 A (total weight = 0.27) T q C Output tree: U rule8 rule1 0.6 rule3 T rule5 U x1 q C R q x0 X G q B V rule3 S C q x1 D rule1 A 1.0 rule1 A q x1 x1 rule5 q F 0.9 V rule6 q F 0.1 W rule7 q G 0.5 V rule8 q G 0.5 W q x0 qstart 1.0 q.1.12 1.0 q.2.11 0.6 q.2.11 0.4 q.21.111 0.9 q.22.112 0.5 q.21.112 0.1 q.22.111 0.5 rule1(q.1.12, q.2.11) rule2 rule3(q.21.111, q.22.112) rule4(q.21.112, q.22.111) rule5 rule8 rule6 rule7 Naïve EM Algorithm for xR Initialize uniform probabilities Until tired do: Make zero rule count table For each training case <i, o> Compute P(o | i) by summing over derivations d For each derivation d for <i, o> /* exponential! */ Compute P(d | i, o) = P(d, o | i) / P(o | i) For each rule r used in d count(r) += P(d | i, o) Normalize counts to probabilities Efficient EM for xR Initialize uniform probabilities Until tired do: Make zero rule count table For each training case <i, o> Build packed derivation forest O(q n2 r) time/space Use inside-outside algorithm to collect rule counts Inside pass O(q n2 r) time/space Outside pass O(q n2 r) time/space Count collection pass O(q n2 r) time/space Normalize counts to probabilities (joint or conditional) Per-example training complexity is O(n2) x “transducer constant” -- same as forward-backward training for string transducers [Baum & Welch 71]. Variation for xRS runs in O(q n4 r) time (if |RHS| = 2). Training: Related Work • Generalizes specific MT model training algorithm in the appendix of [Yamada/Knight 01] – xR training not tied to particular MT model, or to MT at all • Generalizes forward-backward HMM training [Baum et al 70] – Write strings vertically, and use xR training • Generalizes inside-outside PCFG training [Lari/Young 90] – Attach fixed input tree to each “output” string, and use xRS training • Generalizes synchronous tree-substitution grammar (STSG) training [Eisner 03] – xR and xRS allow copying of subtrees Picture So Far Strings Grammars Regular grammar, CFG Acceptors Finite-state acceptor (FSA), PDA Trees DONE (RTG) DONE (TDTA) String Pairs Tree Pairs (inputoutput) (inputoutput) Transducers Finite-state transducer (FST) DONE If You Want to Play Around … • Tiburon: A Weighted Tree Automata Toolkit – www.isi.edu/licensed-sw/tiburon – Developed by Jonathan May at USC/ISI – Portable, written in Java • Implements operations on tree grammars and tree transducers – k-best trees, determinization, application, training, … • Tutorial primer comes with explanations, sample automata, and exercises – Hands-on follow-up to a lot of what we covered today • Beta version released June 2006 Part 1: Introduction Part 2: Synchronous Grammars < break > Part 3: Tree Automata Part 4: Conclusion Why Worry About Formal Languages? • They already figured out most of the key things back in the 1960s & 1970s • Lucky us! Relation of Synchronous Grammars with Tree Transducers • Developed largely independently – Synchronous grammars • [Aho & Ullman 69, Lewis & Stearns 68] inspired by compilers • [Shieber and Schabes 90] inspired by syntax/semantics mapping – Tree Transducers • [Rounds 70] inspired by transformational grammar • Recent work is starting to relate the two – [Shieber 04, 06] explains both in terms of abstract bimorphisms – [Graehl, Hopkins, Knight in prep] relates STSG to xRLN & gives composition closure results Wait, Maybe They Didn’t Figure Out Everything in the 1960s and 1970s … • Is there a formal tree transformation model that fits natural language problems and is closed under composition? • Is there an “markovized” version of R-style transducers, with horizontal Thatcher-like processing of children? • Are there transducers that can move material across unbounded distance in a tree? Are they trainable? • Are there automata models that can work with string/string data? • Can we build large, accurate, probabilistic syntax-based language models of English using RTGs? • What about root-less transformations like – JJ(big) NN(cheese)  jefe • Can we build probabilistic tree toolkits as powerful and ubiquitous as string toolkits? More To Do on Synchronous Grammars • Connections between synchronous grammars and tree transducers and transfer results between them • Degree of formal power needed in theory and practice for translation and paraphrase [Wellington et al 06] • New kinds of synchronous grammars that are more powerful but not more expensive, or not much more • Efficient and optimal algorithms for minimizing the rank of a synchronous CFG [Gildea, Satta, Zhang 06] • Faster strategies for translation with an n-gram language model [Zollmann & Venugopal 06; Chiang, to appear?] Further Readings • Synchronous grammar – Attached notes by David (with references) • Tree automata – [Knight and Graehl 05] overview and relationship to natural language processing – Tree Automata textbook [Gécseg & Steinby 84] – TATA textbook on the Internet [Comon et al 97] References on These Slides [Aho & Ullman 69] ACM STOC [Aho & Ullman 71] Information and Control, 19 [Bangalore & Rambow 00] COLING [Baum et al 70] Ann. Math. Stat. 41 [Chiang 05] ACL [Chomsky 57] Syntactic Structures [Church 77] ANLP [Comon et al 97] www.grappa.univ-lille3.fr/tata/ [Dijkstra 59] Numer. Math. 1 [Doner 70] J. Computer & Sys. Sci. 4 [Echihabi & Marcu 03] ACL [Eisner 03] HLT [Eppstein 98] Siam J. Computing 28 [Galley et al 04, 06] HLT, ACL-COLING [Gecseg & Steinby 84] textbook [Gildea, Satta, Zhang 06] COLING-ACL [Graehl & Knight 04] HLT [Jelinek 90] Readings in Speech Recog. [Knight & Al-Onaizan 98] AMTA [Knight and Graehl 05] Cicling [Knight & Marcu 00] AAAI [Knuth 77] Info. Proc. Letters 6 [Kumar & Byrne 03] HLT [Lewis & Stearns 68] [Lari & Young 90] [Marcu et al 06] [May & Knight 05] [Melamed 03] [Mohri & Riley 02] [Mohri, Pereira, Riley 00] [Pang et al 03] [Rabin 69] [Rounds 70] [Shieber & Schabes 90] [Shieber 04, 06] [Thatcher 67] [Thatcher 70] [Thatcher & Wright 68] [Wellington et al 06] [Wu 97] [Yamada & Knight 01, 02] [Zollmann & Venugopal 06] JACM Comp. Speech and Lang. 4 EMNLP HLT HLT ICSLP Theo. Comp. Sci. HLT Trans. Am. Math. Soc. 141 Math. Syst. Theory 4 COLING TAG+, EACL J. Computer & Sys. Sci. 1 J. Computer & Sys. Sci. 4 Math. Syst. Theory 2 ACL-COLING Comp. Linguistics ACL HLT Thank you

Weighted Finite-State Transducer

Related documents

Products

Support

Weighted Finite-State Transducer

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib