Shuffling Non-Constituents Jason Eisner with David A. Smith syntactically-flavored reordering model and Roy Tromble syntactically-flavored reordering search methods ACL SSST Workshop (invited talk), June 2008 1 Starting point: Synchronous alignment Synchronous grammars are very pretty. But does parallel text actually have parallel structure? Depends on what kind of parallel text Free translations? Noisy translations? Were the parsers trained on parallel annotation schemes? Depends on what kind of parallel structure What kinds of divergences can your synchronous grammar formalism capture? E.g., wh-movement versus wh in situ Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 2 Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. donnent (“give”) kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) Sam Sam kids often quite d’ (“of”) enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often” Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam Sam NP Adv kids null null NP Adv often quite NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often” Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. A much worse alignment ... donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) enfants (“kids”) Sam NP Sam kids NP often quite NP Adv “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often” Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam Sam NP Adv kids null null NP Adv often quite NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often” Synchronous Grammar = Set of Elementary Trees donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) NP Adv null NP Adv beaucoup (“lots”) d’ (“of”) null NP kids NP null NP enfants (“kids”) Sam null NP often Adv Adv Sam quite But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 to this question 8 But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer to this question Displaced modifier (negation) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 9 But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer to this question Displaced modifier (negation) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10 But many examples are harder Auf To diese Frage habe this question have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer to this question Displaced argument (here, because projective parser) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 11 But many examples are harder Auf To diese this Frage question habe have ich I leider alas keine no Antwort answer bekommen received NULL I did not unfortunately receive an answer to this question Head-swapping (here, just different annotation conventions) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 12 Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on die Reihe kommen the queue come NULL Then we could deal with Chernobyl Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 some time later 13 Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on die Reihe kommen the queue come NULL Then we could deal with Chernobyl some time later Probably not systematic (but words are correctly aligned) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 14 Free Translation Tschernobyl Chernobyl könnte could dann etwas später then something later an on die Reihe kommen the queue come NULL Then we could deal with Chernobyl some time later Erroneous parse Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 15 What to do? Current practice: Don’t try to model all systematic phenomena! Just use non-syntactic alignments (Giza++). Only care about the fragments that recur often Phrases or gappy phrases Sometimes even syntactic constituents (can favor these, e.g., Marton & Resnik 2008) Use these (gappy) phrases in a decoder Phrase based or hierarchical Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 16 What to do? Current practice: But could syntax give us better alignments? Use non-syntactic alignments (Giza++) Keep frequent phrases for a decoder Would have to be “loose” syntax … Why do we want better alignments? 1. Throw away less of the parallel training data 2. Help learn a smarter, syntactic, reordering model Could help decoding: less reliance on LM 3. Some applications care about full alignments Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 17 Quasi-synchronous grammar How do we handle “loose” syntax? Translation story: Generate target English by a monolingual grammar Any grammar formalism is okay Pick a dependency grammar formalism for now P(I | did, PRP) I did not unfortunately receive an answer P(PRP | no previous left children of “did”) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 to this question parsing: O(n3) 18 Quasi-synchronous grammar How do we handle “loose” syntax? Translation story: Generate target English by a monolingual grammar But probabilities are influenced by source sentence I did Each English node is aligned to some source node Prefers to generate children aligned to nearby source nodes not unfortunately receive an answer to this question parsing: O(n3) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 19 QCFG Generative Story observed Auf diese Frage habe ich leider keine Antwort bekommen NULL P(parent-child) P(I | did, PRP, ich) I did not P(breakage) unfortunately receive an answer P(PRP | no previous left children of “did”, habe) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 to this question aligned parsing: O(m2n3) 20 What’s a “nearby node”? Given parent’s alignment, where might child be aligned? synchronous grammar case + “none of the above” Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 21 Quasi-synchronous grammar How do we handle “loose” syntax? Translation story: Generate target English by a monolingual grammar But probabilities are influenced by source sentence Useful analogies: 1. 2. Generative grammar with latent word senses Source MEMM Generate n-gram tag sequence, Target but probabilities are influenced by word sequence Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 22 Quasi-synchronous grammar How do we handle “loose” syntax? Translation story: Generate target English by a monolingual grammar But probabilities are influenced by source sentence Useful analogies: 1. 2. 3. Generative grammar with latent word senses MEMM IBM Model 1 Source nodes can be freely reused or unused Future work: Enforce 1-to-1 to allow good decoding (NP-hard to do exactly) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 23 Some results: Quasi-synch. Dep. Grammar Alignment (D. Smith & Eisner 2006) Quasi-synchronous syntax much better than synchronous Maybe also better than IBM Model 4 Question answering (Wang et al. 2007) Align question w/ potential answer Mean average precision 43% 48% 60% previous state of the art + QG + lexical features Bootstrapping a parser for a new language (D. Smith & Eisner 2007 & ongoing) Learn how parsed parallel text influences target dependencies Along with many other features! (cf. co-training) Unsupervised: German 30% 69%, Spanish 26% 65% Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 24 Summary of part I Current practice: Use non-syntactic alignments (Giza++) Some bits align nicely Use the frequent bits in a decoder Suggestion: Let syntax influence alignments. So far, loose syntax methods are like IBM Model I. NP-hard to enforce 1-to-1 in any interesting model. Rest of talk: How to enforce 1-to-1 in interesting models? Can we do something smarter than beam search? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 25 Shuffling Non-Constituents Jason Eisner with David A. Smith syntactically-flavored reordering model and Roy Tromble syntactically-flavored reordering search methods ACL SSST Workshop, June 2008 26 Motivation MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works! Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 27 Permutation search in MT NNP 1 NEG 2 PRP 3 Marie ne m’ 1 4 2 Mary hasn’t AUX 4 NEG 5 VBN 6 initial order vu (French) a pas 5 6 3 best order (French’) seen me easy transduction Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 28 Motivation MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works! Have just to fix that pesky word order. Framing it this way lets us enforce 1-to-1 exactly at the permutation step. Deletion and fertility > 1 are still allowed in the subsequent transduction. Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 29 Often want to find an optimal permutation … Machine translation: Reorder French to French-prime (Brown et al. 1992) So it’s easier to align or translate MT eval: How much do you need to rearrange MT output so it scores well under an LM derived from ref translations? Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003) So they flow nicely Reconstruct temporal order of events after info extraction Learn rule ordering or constraint ranking for phonology? Multi-word anagrams that score well under a LM Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 30 Permutation search: The problem 1 2 3 4 5 6 1 4 2 5 6 3 initial order best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 32 Traditional approach: Beam search Approx. best path through a really big FSA N! paths: one for each permutation only 2N states state remembers what we’ve generated so far (but not in what order) arc weight = cost of picking 5 next if we’ve seen {1,2,4} so far Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 33 An alternative: Local search (“hill climbing”) The SWAP neighborhood 132456 213456 cost=20 cost=26 123456 cost=22 123456 cost=22 124356 cost=19 123546 cost=25 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 34 An alternative: Local search (“hill-climbing”) The SWAP neighborhood 123456 cost=22 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 124356 cost=19 35 An alternative: Local search (“hill-climbing”) Like “greedy decoder” of Germann et al. 2001 The SWAP neighborhood 1 2 3 4 5 6 cost=22 cost=19 cost=17 cost=16 ... we pick best swap Why are the costs always going down? How long does it take to pick best swap? O(N) if you’re careful O(N2) How many swaps might you need to reach answer? random restarts What if you get stuck in a local min? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 36 Larger neighborhood 132456 213456 cost=20 cost=26 123456 cost=22 123456 cost=22 124356 cost=19 123546 cost=25 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 37 Larger neighborhood (well-known in the literature; works well) INSERT neighborhood 1 2 3 4 5 6 cost=22 cost=17 Fewer local minima? yes – 3 can move past 4 to get past 5 Graph diameter (max #moves needed)? O(N) rather than O(N2) O(N2) rather than O(N) How many neighbors? O(N2) rather than O(N) How long to find best neighbor? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 38 Even larger neighborhood BLOCK neighborhood 1 2 3 4 5 6 cost=22 cost=14 yes – 2 can get past 45 without having to cross 3 or move 3 first Fewer local minima? still O(N) Graph diameter (max #moves needed)? O(N3) rather than O(N), O(N2) How many neighbors? How long to find best neighbor? O(N3) rather than O(N), O(N2) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 39 Larger yet: Via dynamic programming?? 1 2 3 4 5 6 Fewer local minima? Graph diameter (max #moves needed)? How many neighbors? How long to find best neighbor? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 cost=22 logarithmic exponential polynomial 40 Unifying/generalizing neighborhoods so far 1 i 2 3 j 4 5 6 7 k 8 Exchange two adjacent blocks, of max widths w ≤ w’ Move is defined by an (i,j,k) triple SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) O(N) everything in this talk can be generalized to O(N2) other values of w,w’ O(N3) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 41 Very large-scale neighborhoods What if we consider multiple simultaneous exchanges that are “independent”? 1 3 2 5 4 6 The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000) 22 1 11 4 3 2 3 3 4 2 55 6 Lowest-cost neighbor is lowest-cost path 5 6 5 44 Cost of this arc is Δcost of swapping (4,5), here < 0 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 42 Very large-scale neighborhoods 2 1 1 4 2 3 6 4 2 3 3 Lowest-cost neighbor is lowest-cost path 5 5 6 4 5 Why would this be a good idea? Help get out of bad local minima? no; they’re still local minima yes – less greedy Help avoid getting into bad local minima? 0 B= -20 0 80 0 0 -30 -0 0 0 0 -20 0 0 0 0 2 1 1 2 3 4 3 3 DYNASEARCH (-20+-20) 4 2 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 SWAP (-30) 43 Very large-scale neighborhoods 2 1 1 2 3 4 3 3 4 2 5 6 5 5 6 Lowest-cost neighbor is lowest-cost path 4 Why would this be a good idea? Help get out of bad local minima? no; they’re still local minima yes – less greedy Help avoid getting into bad local minima? More efficient? yes! – shortest-path algorithm finds the best set of swaps in O(N) time, as fast as best single swap. Up to N moves as fast as 1 move:no penalty for “parallelism”! Globally optimizes over exponentially many neighbors (paths). Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 44 Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP? 1 i 2 3 j 4 5 6 7 k 8 Exchange two adjacent blocks, of max widths w ≤ w’ Move is defined by an (i,j,k) triple SWAP: w=1, w’=1 INSERT: w=1, w’=N BLOCK: w=N, w’=N runtime = # neighbors = O(ww’N) O(N) Yes. 2) O(N Asymptotic runtime is always unchanged. O(N3) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 45 Let’s define each neighbor by a “colored tree” Just like ITG! = swap children 1 2 3 4 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 5 6 46 Let’s define each neighbor by a “colored tree” Just like ITG! = swap children 1 2 3 4 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 5 6 47 Let’s define each neighbor by a “colored tree” Just like ITG! = swap children 5 6 1 2 3 4 This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested. Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 48 If that was the optimal neighbor … … now look for its optimal neighbor new tree! 5 6 1 4 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 2 3 49 If that was the optimal neighbor … … now look for its optimal neighbor new tree! 5 6 1 4 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 2 3 50 If that was the optimal neighbor … … now look for its optimal neighbor … repeat till reach local optimum Each tree defines a neighbor. At each step, optimize over all possible trees by dynamic programming (CKY parsing). 1 4 2 5 6 3 Use your favorite parsing speedups (pruning, best-first, …) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 51 Very-large-scale versions of SWAP, INSERT, and BLOCK all by the algorithm we just saw … 1 i 2 3 j 4 5 6 7 k 8 Exchange two adjacent blocks, of max widths w ≤ w’ Move is defined by an (i,j,k) triple Runtime of the algorithm we just saw was O(N3) because we considered O(N3) distinct (i,j,k) triples More generally, restrict to only the O(ww’N) triples of interest to define a smaller neighborhood with runtime of O(ww’N). (yes, the dynamic programming recurrences go through) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 52 How many steps to get from here to there? 6 2 5 8 4 3 7 1 initial order One twisted-tree step? No: As you probably know, 3 1 4 2 1 2 3 4 is impossible. 1 2 3 4 5 6 7 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 8 best order 53 Can you get to the answer in one step? not always (yay, local search) often (yay, big neighborhood) for longer sentences, usually not German-English, Giza++ alignment Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 54 How many steps to the answer in the worst case? (what is diameter of the search space?) 6 2 5 8 4 3 7 1 claim: only log2N steps at worst (if you know where to step) Let’s sketch the proof! 1 2 3 4 5 6 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 7 8 55 Quicksort anything into, e.g., 1 2 3 4 5 6 7 8 right-branching tree 6 2 5 8 4 3 4 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 7 1 5 56 Quicksort anything into, e.g., 1 2 3 4 5 6 7 8 Only log2 N steps to get to 1 2 3 4 5 6 7 8 … … or to anywhere! sequence of right-branching trees 2 4 2 4 3 3 1 7 8 6 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 5 6 5 7 57 Defining “best order” What class of cost functions can we handle efficiently? How fast can we compute a subtree’s cost from its child subtrees? 1 2 3 4 5 6 1 4 2 5 6 3 initial order best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 58 Defining “best order” What class of cost functions? 0 -30 A= 15 12 7 6 15 22 80 0 -76 24 63 -44 0 -15 71 -99 28 8 -31 5 -7 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 a14 + a42 + a25 + a56 + a63 + a31 1 4 2 5 6 3 best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 59 Defining “best order” What class of cost functions? 0 B= 5 -22 93 8 6 8 -31 -6 54 24 82 12 0 -7 41 0 -9 88 17 -6 0 11 -17 5 10 -59 4 -12 6 12 -60 0 23 55 0 b26 = cost of 2 preceding 6 (add up n(n-1)/2 such costs) (any order will incur either b26 or b62) 1 4 2 5 6 3 best order according to How can we find this needle some cost in the haystack of N! function possible permutations? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 60 Defining “best order” What class of cost functions? TSP and LOP are both NP-complete In fact, believed to be inapproximable hard even to achieve C * optimal cost (any C≥1) Practical approaches: correct answer, typically fast branch-and-bound, fast answer, typically close to correct beam search, ILP, … this talk, … Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 61 Defining “best order” What class of cost functions? 1 2 3 4 5 6 initial order 1 4 2 5 6 3 cost of this order: 4 1…2…3? before 3 …? Generalizes TSP 1. Does my favorite WFSA like this string of #s? 2. Non-local pair order ok? 3. Non-local triple order ok? Can add these all up … LOP Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 63 Costs are derived from source sentence features NNP 1 NEG PRP 2 Marie AUX 4 3 m’ ne a VBN initial order NEG 5 6 pas (French) vu ne would like to be brought adjacent to the next NEG word 0 -30 A= 15 12 7 6 15 -7 0 24 63 -44 12 0 0 -15 71 -99 -7 88 22 80 0 -76 28 8 -31 5 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 B= 5 -22 93 8 6 8 -31 -6 54 41 0 -9 24 82 17 -6 0 11 -17 -75 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 12 -60 0 23 55 0 64 Costs are derived from source sentence features NNP 1 NEG 2 Marie -30 15 12 7 6 15 AUX 4 3 m’ ne 0 A= PRP a 22 80 0 -76 24 63 -44 0 -15 71 -99 28 8 -31 5 -7 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 VBN initial order NEG 5 6 pas (French) vu 50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie) 0 5 a-22 93 8 56 +27: words at distance of shouldn’t 12 swap 0 order 8 -31 -6 54 -2: words -7 with 41 PRP 0 between -9 24 82 them ought to swap 88 17 -6 0 12 -60 … = 75 B= 11 -17 75 10 -59 4 -12 6 0 23 55 0 Can also include phrase boundary symbols in the input! Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 65 Costs are derived from source sentence features NNP 1 NEG PRP 2 Marie AUX 4 3 m’ ne a VBN initial order NEG 5 6 pas (French) vu FSA costs: Distortion model Language model – looks ahead to next step! ( good finite-state translation into good English?) 0 -30 A= 15 12 7 6 15 -7 0 24 63 -44 12 0 0 -15 71 -99 -7 88 22 80 0 -76 28 8 -31 5 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 B= 5 -22 93 8 6 8 -31 -6 54 41 0 -9 24 82 17 -6 0 11 -17 75 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 12 -60 0 23 55 0 66 Dynamic program must pick the tree that leads to the lowest-cost permutation 1 2 3 4 5 6 initial order 1 4 2 5 6 3 cost of this order: 1. Does my favorite WFSA like it as a string? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 67 Scoring with a weighted FSA This particular WFSA implements TSP scoring for N=3: After you read 1, you’re in state 1 After you read 2, you’re in state 2 After you read 3, you’re in state 3 … and this state determines the cost of the next symbol you read nitial We’ll handle a WFSA with Q states by using a fancier grammar, with nonterminals. (Now runtime goes up to O(N3Q3) …) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 68 Including WFSA costs via nonterminals A possible preterminal for word 2 is an arc in A that’s labeled with 2. 4 2 2 The preterminal 42 rewrites as word 2 with a cost equal to the arc’s cost. 61 1 42 23 2 3 14 4 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 I5 5 56 6 69 5 I 6 5 6 1 1. 4 4 2 2 3 Including WFSA costs via nonterminals This constituent’s total cost is the total cost of the best 63 path I3 6 1 1 4 4 . 2 2 3 3 cost of the new permutation 3 63 63 13 I6 61 1 43 I6 42 23 2 3 14 4 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 I5 5 56 6 70 Dynamic program must pick the tree that leads to the lowest-cost permutation 1 2 3 4 5 6 initial order 1 4 2 5 6 3 cost of this order: 4 before 3 …? 1. Does my favorite WFSA like it as a string? 2. Non-local pair order ok? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 71 Incorporating the pairwise ordering costs This puts {5,6,7} before {1,2,3,4}. 1 2 3 4 5 6 So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4 7 Uh-oh! So now it takes O(N2) time to combine two subtrees, instead of O(1) time? Nope – dynamic programming to the rescue again! Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 72 Computing LOP cost of a block move 1 2 3 4 This puts {5,6,7} before {1,2,3,4}. revise So we have to add O(N2) costs just to consider this single neighbor! 1 2 3 4 5 6 = 5 6 7 + 5 6 7 7 Reuse work from other, “narrower” block moves … computed new cost in O(1)! 7 1 2 3 4 1 2 3 4 5 6 1 2 3 4 - 1 2 3 4 5 6 +6 7 7 already computed at earlier steps Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 of parsing 5 73 Incorporating 3-way ordering costs See the initial paper (Eisner & Tromble 2006) A little tricky, but comes “for free” if you’re willing to accept a certain restriction on these costs more expensive without that restriction, but possible Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 74 Another option: Markov chain Monte Carlo Random walk in the space of permutations interpret a permutation’s cost as a log-probability p(π) = exp(–cost(π)) / Z Sample a permutation from the neighborhood instead of always picking the most probable Why? Simulated annealing might beat greedy-with-random-restarts When learning the parameters of the distribution, can use sampling to compute the feature expectations Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 75 Another option: Markov chain Monte Carlo Random walk in the space of permutations interpret a permutation’s cost as a log-probability p(π) = exp(–cost(π)) / Z Sample a permutation from the neighborhood instead of always picking the most probable How? Pitfall: Sampling a permutation sampling a tree Spurious ambiguity: some permutations have many trees Solution: Exclude some trees, leaving 1 per permutation Normal form has long been known for colored trees For restricted colored trees (which limit the size of blocks to swap), we’ve devised a more complicated normal form Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 76 Learning the costs Where do these costs come from? If we have some examples on which we know the true permutation, could try to learn them 0 -30 A= 15 12 7 6 15 -7 0 24 63 -44 12 0 0 -15 71 -99 -7 88 22 80 0 -76 28 8 -31 5 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 B= 5 -22 93 8 6 8 -31 -6 54 41 0 -9 24 82 17 -6 0 11 -17 75 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 12 -60 0 23 55 0 78 Learning the costs Where do these costs come from? If we have some examples on which we know the true permutation, could try to learn them More precisely, try to learn these weights θ (the knowledge that’s reused across examples) 0 -30 A= 15 12 7 6 15 22 80 0 -76 24 63 -44 0 -15 71 -99 28 8 -31 5 -7 0 54 -6 41 24 0 82 5 -22 8 93 0 -9 50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie) 0 at5 a -22 93 of 85 6 27: words distance shouldn’t 12 swap 0 order 8 -31 -6 54 -2: words -7 with 41 PRP 0 between -9 24 82 them ought to swap 88 17 -6 0 12 -60 … B= 11 -17 75 Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10 -59 4 -12 6 0 23 55 0 79 Experimenting with training LOP params (LOP is quite fast: O(n3) with no grammar constant) PDS VMFIN PPER ADV Das kann ich so APPR ART NN PTKNEG VVINF $. aus dem Stand nicht sagen . B[7,9] Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 83 Feature templates for cost of swapping i, j 22 features plus versions of all of these conjoined with the distance j - i (binned) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 84 Feature templates for cost of swapping i, j Only LOP features so far And they’re unnecessarily simple 22 features (don’t examine syntactic constituency) And input sequence is only wordsplus versions of all of these (not interspersed with syntactic brackets) conjoined with the distance j-i (binned) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 85 Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) MOSES baseline German LOP German’ English MOSES Define German’ to be German in English word order To get German’ for training data, use Giza++ to align all German positions to English positions (disallow NULL) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 86 Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) MOSES baseline German LOP German’ English MOSES Easy first try: Naïve Bayes Treat each feature in θ as independent Count and normalize over the training data No real improvement over baseline Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 87 Learning LOP Costs for MT (interesting, if odd, to try to reorder with only the LOP costs) MOSES baseline German LOP German’ English MOSES Easy second try: Perceptron 0 1 search . . . local n optimum error update gold standard ˆ global optimum * Note: Search error can be beneficial, e.g., just take 1 step from identity permutation Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 88 Benefit from reordering Learning method BLEU vs. BLEU vs. German′ English No reordering 49.65 Naïve Bayes—POS 49.21 Naïve Bayes—POS+lexical 49.75 Perceptron—POS 50.05 25.92 Perceptron—POS+lexical 51.30 26.34 25.55 obviously, not yet unscrambling German: need more features Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 90 Alternatively, work back from gold standard Contrastive estimation (Smith & Eisner 2005) 1-step verylarge-scale neighborhood * gold standard Maximize the probability of the desired permutation relative to its ITG neighborhood Requires summing all permutations in a neighborhood Must use normal-form trees here Stochastic gradient descent Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 91 Alternatively, work back from gold standard k-best MIRA in the neighborhood 1-step verylarge-scale neighborhood * gold standard current winners in the neighborhood Make gold standard beat its local competitors Beat the bad ones by a bigger margin Good = close to gold in swap distance? Good = close to gold using BLEU? Good = translates into English that’s close to reference? Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 92 Alternatively, train each iterate model best in neigh of (0) 0 1 update * 0 ... update * 1 oracle in neigh of (0) n update * n Or could do a k-best MIRA version of this, too; even use a loss measure based on lookahead to(n) Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 93 Summary of part II Local search is fun and easy Probably useful for translation Popular elsewhere in AI Closely related to MCMC sampling Maybe other NP-hard problems too Can efficiently use huge local neighborhoods Algorithms are closely related to parsing and FSMs Our community knows that stuff better than anyone! Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 95