Dependency Parsing by Belief Propagation David A. Smith (JHU UMass Amherst) Jason Eisner (Johns Hopkins University) 1 Outline Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing Old New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 2 Outline Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing Old New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 3 Word Dependency Parsing Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. Part-of-speech tagging POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Word dependency parsing Word dependency parsed sentence He reckons the current account deficit will narrow to only 1.8 billion in September . SUBJ MOD MOD MOD SUBJ COMP COMP SPEC S-COMP ROOT slide adapted from Yuji Matsumoto 4 What does parsing have to do with belief propagation? loopy belief propagation loopy belief propagation 5 Outline Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing Old New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 6 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … each choice depends on a limited part of the history p(D | A,B,C)? p(D | A,B,C)? … p(D | A,B) * p(C | A,B,D)? but which dependencies to allow? what if they’re all worthwhile? 7 Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … which dependencies to allow? (given limited training data) Solution: Log-linear (max-entropy) modeling (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * …throw them all in! Features may interact in arbitrary ways Iterative scaling keeps adjusting the feature weights until the model agrees with the training data. 8 How about structured outputs? Log-linear models great for n-way classification Also good for predicting sequences v a n find preferred tags but to allow fast dynamic programming, only use n-gram features Also good for dependency parsing …find preferred links… but to allow fast dynamic programming or MST parsing, only use single-edge features 9 How about structured outputs? …find preferred links… but to allow fast dynamic programming or MST parsing, only use single-edge features 10 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? yes, lots of green ... Byl jasný studený dubnový den a hodiny odbíjely třináctou “It was a bright cold day in April and the clocks were striking thirteen” 11 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? jasný den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou “It was a bright cold day in April and the clocks were striking thirteen” 12 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? jasný N jasný den (“bright NOUN”) (“bright day”) Byl V jasný studený dubnový den a hodiny odbíjely třináctou A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 13 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? jasný N jasný den (“bright NOUN”) (“bright day”) AN Byl V jasný studený dubnový den a hodiny odbíjely třináctou A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 14 Edge-Factored Parsers (McDonald et al. 2005) Is this a good edge? jasný N jasný den (“bright A N day”) (“bright NOUN”) preceding conjunction Byl V AN jasný studený dubnový den a hodiny odbíjely třináctou A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 15 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? not as good, lots of red ... Byl V jasný studený dubnový den a hodiny odbíjely třináctou A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 16 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? jasný hodiny (“bright clocks”) ... undertrained ... Byl V jasný studený dubnový den a hodiny odbíjely třináctou A A A N J N V C “It was a bright cold day in April and the clocks were striking thirteen” 17 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? jasný hodiny jasn hodi (“bright clocks”) (“bright clock,” stems only) ... undertrained ... Byl V jasný studený dubnový den a hodiny odbíjely třináctou A byl jasn A A stud dubn N J den a N V C hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 18 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? jasný hodiny jasn hodi (“bright clocks”) (“bright clock,” stems only) ... undertrained ... Byl V Aplural Nsingular jasný studený dubnový den a hodiny odbíjely třináctou A byl jasn A A stud dubn N J den a N V C hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 19 Edge-Factored Parsers (McDonald et al. 2005) How about this competing edge? jasn hodi jasný hodiny (“bright clocks”) AN (“bright clock,” stems only) where N follows ... ... undertrained a conjunction Byl V Aplural Nsingular jasný studený dubnový den a hodiny odbíjely třináctou A byl jasn A A stud dubn N J den a N V C hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 20 Edge-Factored Parsers (McDonald et al. 2005) Which edge is better? “bright day” or “bright clocks”? Byl V jasný studený dubnový den a hodiny odbíjely třináctou A byl jasn A A stud dubn N J den a N V C hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 21 Edge-Factored Parsers (McDonald et al. 2005) our current weight vector Which edge is better? Score of an edge e = features(e) Standard algos valid parse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A byl jasn stud dubn N J den a N V C hodi odbí třin “It was a bright cold day in April and the clocks were striking thirteen” 22 Edge-Factored Parsers (McDonald et al. 2005) our current weight vector Which edge is better? Score of an edge e = features(e) Standard algos valid parse with max total score can’t have both (one parent per word) Can’t have all three (no cycles) can‘t have both (no crossing links) Thus, an edge may lose (or win) because of a consensus of other edges. 23 Outline Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing Old New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 24 Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming The cat in the hat wore a stovepipe. ROOT let’s vertically stretch this graph drawing ROOT wore cat The stovepipe in a hat the each subtree is a linguistic constituent (here a noun phrase) 25 Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case to score “cat wore” link, not enough to know this is NP must know it’s rooted at “cat” so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...} ROOT constant so CKY’s “grammar constant” is no longer wore cat The stovepipe in a hat the each subtree is a linguistic constituent (here a noun phrase) 26 Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner 1996) Back to O(n3) ROOT wore cat The stovepipe in a hat the each subtree is a linguistic constituent (here a noun phrase) 27 Spans vs. constituents Two kinds of substring. » Constituent of the tree: links to the rest only through its headword (root). The cat in the hat wore a stovepipe. ROOT » Span of the tree: links to the rest only through its endwords. The cat in the hat wore a stovepipe. ROOT 28 Decomposing a tree into spans The cat in the hat wore a stovepipe. ROOT The cat + cat in the hat wore a stovepipe. ROOT cat in the hat wore cat in + + wore a stovepipe. ROOT in the hat wore in the hat + hat wore Finding Highest-Scoring Parse Convert to context-free grammar (CFG) Then use dynamic programming CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner 1996) Back to O(n3) Can play usual tricks for dynamic programming parsing Further refining the constituents or spans Allow prob. model to keep track of even more internal information A*, best-first, coarse-to-fine Training by EM etc. require “outside” probabilities of constituents, spans, or links 30 Hard Constraints on Valid Trees our current weight vector Score of an edge e = features(e) Standard algos valid parse with max total score can’t have both (one parent per word) Can’t have all three (no cycles) can‘t have both (no crossing links) Thus, an edge may lose (or win) because of a consensus of other edges. 31 Non-Projective Parses ROOT I ‘ll give a talk tomorrow on bootstrapping subtree rooted at “talk” is a discontiguous noun phrase can‘t have both (no crossing links) The “projectivity” restriction. Do we really want it? 32 Non-Projective Parses ROOT I ‘ll give a talk tomorrow on bootstrapping occasional non-projectivity in English ROOT ista meam norit gloria canitiem thatNOM myACC may-know gloryNOM going-grayACC That glory may-know my going-gray (i.e., it shall last till I go gray) frequent non-projectivity in Latin, etc. 33 Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximumweight spanning tree (right) – may be non-projective. Can be found in time O(n2). 9 root root 10 10 30 9 30 saw 20 30 saw 0 30 Mary John 11 3 slide thanks to Dragomir Radev John Mary Every node selects best parent If cycles, contract them and repeat 34 Summing over all non-projective trees Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximumweight spanning tree (right) – may be non-projective. Can be found in time O(n2). How about total weight Z of all trees? How about outside probabilities or gradients? Can be found in time O(n3) by matrix determinants and inverses (Smith & Smith, 2007). slide thanks to Dragomir Radev 35 Graph Theory to the Rescue! O(n3) time! Tutte’s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! 36 Building the Kirchoff (Laplacian) Matrix s(1,0) s(2,0) s(2,0) 00s (1s(1,0) , j ) s(2,1) 10 s(1, j) s(2,1) j 0 0 s(2,1) j1 s ( 1 , 2) s (2, j ) 0 j) s(1,2) j 2 s(2, 00 s(1,2) j2 n) ss(2,n) (2, n) 00s (1,s(1,n) s(1,n) s(2,n) s(n,0) s(n,0) s (n,1 ) s(n,1) s(n,1) s(n,2) s(n,2) s(n,2) 0 s ( n , j) s(n, j) j n jn • Negate edge scores • Sum columns (children) • Strike root row/col. • Take determinant N.B.: This allows multiple children of root, but see Koo et al. 2007. 37 Why Should This Work? Clear for 1x1 matrix; use induction s(1, j) j1 s(1,2) s(2,1) s(n,1) s(2, j) s(n,2) Chu-Liu-Edmonds analogy: Every node selects best parent If cycles, contract and recur j2 s(1,n) s(n, j) s(2,n) jn K K with contracted edge 1, 2 K K({1,2} |{1,2}) K s(1,2) K K Undirected case; special root cases for directed 38 Outline Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing Old New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 39 Exactly Finding the Best Parse …find preferred links… but to allow fast dynamic programming or MST parsing, only use single-edge features With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming grandparents grandp. + sibling bigrams POS trigrams sibling pairs (non-adjacent) O(n5) O(n4) O(n3g6) … O(2n) Non-projective: O(n2) by minimum spanning tree NP-hard • any of the above features • soft penalties for crossing links • pretty much anything else! 40 Let’s reclaim our freedom (again!) This paper in a nutshell Output probability is a product of local factors Throw in any factors we want! (log-linear model) (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ How could we find best parse? Integer linear programming (Riedel et al., 2006) MCMC doesn’t give us probabilities when training or parsing Slow to mix? High rejection rate because of hard TREE constraint? Greedy hill-climbing (McDonald & Pereira 2006) none of these exploit tree structure of parses as the first-order methods do 41 Let’s reclaim our freedom (again!) This paper in a nutshell certain global factors ok too Output probability is a product of local factors Throw in any factors we want! (log-linear model) (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ Let local factors negotiate via “belief propagation” Links (and tags) reinforce or suppress one another Each iteration takes total time O(n2) or O(n3) each global factor can be handled fast via some traditional parsing algorithm (e.g., inside-outside) Converges to a pretty good (but approx.) global parse 42 Let’s reclaim our freedom (again!) This paper in a nutshell Training with many features Decoding with many features Iterative scaling Belief propagation Each weight in turn is influenced by others Each variable in turn is influenced by others Iterate to achieve globally optimal weights Iterate to achieve locally consistent beliefs To train distrib. over trees, use dynamic programming to compute normalizer Z To decode distrib. over trees, use dynamic programming to compute messages New! 43 Outline Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing Old New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 44 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) … v v v … preferred find tags Observed input sentence (shaded) 45 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … v a n … preferred find tags Observed input sentence (shaded) 46 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging ”Binary” factor that measures compatibility of 2 adjacent tags v v 0 n 2 a 0 n 2 1 3 a 1 0 1 v v 0 n 2 a 0 n 2 1 3 a 1 0 1 Model reuses same parameters at this position … find … preferred tags 47 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Its values depend on corresponding word … … v 0.2 n 0.2 a 0 find preferred tags can’t be adj 48 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Its values depend on corresponding word … … v 0.2 n 0.2 a 0 find preferred tags (could be made to depend on entire observed sentence) 49 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Different unary factor at each position … … v 0.3 n 0.02 a 0 find v 0.3 n 0 a 0.1 preferred v 0.2 n 0.2 a 0 tags 50 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors’ values on v a n … v v 0 n 2 a 0 v a 1 0 1 v v 0 n 2 a 0 a v 0.3 n 0.02 a 0 find n 2 1 3 n 2 1 3 a 1 0 1 … n v 0.3 n 0 a 0.1 preferred v 0.2 n 0.2 a 0 tags 51 Local factors in a graphical model First, a familiar example Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors’ values on v a n … v v 0 n 2 a 0 v a 1 0 1 v v 0 n 2 a 0 a v 0.3 n 0.02 a 0 find n 2 1 3 n 2 1 3 a 1 0 1 = … 1*3*0.3*0.1*0.2 … … n v 0.3 n 0 a 0.1 preferred v 0.2 n 0.2 a 0 tags 52 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links … find preferred links … 53 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars t f f f f … find t preferred links … 54 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars Another possible parse f f t t f … find f preferred links … 55 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars Another possible parse An illegal parse f t t f … find t (cycle) preferred f links … 56 Local factors in a graphical model First, a familiar example v a n CRF for POS tagging Now let’s do dependency parsing! O(n2) boolean variables for the possible links Possible parse — encoded as an assignment to these vars Another possible parse An illegal parse f Another illegal parse t t f … find t (cycle) t preferred (multiple parents) links … 57 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation But what if the best assignment isn’t a tree?? as before, goodness of this link can t 2 depend on entire f 1 observed input context t 1 f 2 t 1 f 8 t 1 f 3 … find t 1 f 6 preferred t 1 f 2 some other links aren’t as good given this input sentence links … 58 Global factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree … this is a “hard constraint”: factor is either 0 or 1 find preferred ffffff ffffft fffftf … fftfft … tttttt 0 0 0 … 1 … 0 links … 59 Global factors for parsing optionally require the tree to be projective crossing links) So what factors shall we multiply to define (no parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 So far, this is equivalent to edge-factored parsing (McDonald et al. 2005). t f f f f … find t preferred ffffff ffffft fffftf … fftfft … tttttt 0 0 0 … 1 … 0 we’re legal! 64 entries (0/1) links … Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time. They use combinatorial algorithms; so should we! 60 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables grandparent t f t f 1 1 t 1 3 t … find preferred links … 61 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables grandparent no-cross t f t f 1 1 t 1 0.2 t … find preferred links by … 62 Local factors for parsing So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree Second-order effects: factors on 2 variables … this is a “hard constraint”: factor is either 0 or 1 grandparent no-cross siblings hidden POS tags subcategorization … find preferred links by … 63 Outline Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing Old New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 64 Good to have lots of features, but … Nice model Shame about the NP-hardness Can we approximate? Machine learning to the rescue! ML community has given a lot to NLP In the 2000’s, NLP has been giving back to ML Mainly techniques for joint prediction of structures Much earlier, speech recognition had HMMs, EM, smoothing … 65 Great Ideas in ML: Message Passing Count the soldiers there’s 1 of me 1 before you 2 before you 3 before you 4 before you 5 behind you 4 behind you 3 behind you 2 behind you adapted from MacKay (2003) textbook 5 before you 1 behind you 66 Great Ideas in ML: Message Passing Count the soldiers there’s 1 of me 2 before you Belief: Must be 22 +11 +33 = 6 of us 3 only see my incoming behind you messages adapted from MacKay (2003) textbook 67 Great Ideas in ML: Message Passing Count the soldiers there’s 1 of me 1 before you Belief: Belief: Must be Must be 1 1 +11 +44 = 22 +11 +33 = 6 of us 6 of us 4 only see my incoming behind you messages adapted from MacKay (2003) textbook 68 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 1 of me 11 here (= 7+3+1) adapted from MacKay (2003) textbook 69 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here (= 3+3+1) 3 here adapted from MacKay (2003) textbook 70 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 11 here (= 7+3+1) 7 here 3 here adapted from MacKay (2003) textbook 71 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here adapted from MacKay (2003) textbook Belief: Must be 14 of us 72 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here adapted from MacKay (2003) textbook Belief: Must be 14 of us 73 Great ideas in ML: Forward-Backward In the CRF, message passing = forward-backward belief message α … v v 0 n 2 a 0 v 7 n 2 a 1 n 2 1 3 α v 1.8 n 0 a 4.2 av 3 1n 1 0a 6 1 β message v 2v n v1 0 a n7 2 a 0 n 2 1 3 β a 1 0 1 v 3 n 6 a 1 … v 0.3 n 0 a 0.1 find preferred tags 74 Great ideas in ML: Forward-Backward Extend CRF to “skip chain” to capture non-local factor More influences on belief α v 5.4 n 0 a 25.2 β v 3 n 1 a 6 … v 3 n 1 a 6 find v 2 n 1 a 7 … v 0.3 n 0 a 0.1 preferred tags 75 Great ideas in ML: Forward-Backward Extend CRF to “skip chain” to capture non-local factor More influences on belief Red messages not independent? v 5.4` Graph becomes loopy Pretend they are! α n 0 a 25.2` β v 3 n 1 a 6 … v 3 n 1 a 6 find v 2 n 1 a 7 … v 0.3 n 0 a 0.1 preferred tags 76 Two great tastes that taste great together Upcoming attractions … You got belief propagation in my dynamic programming! You got dynamic programming in my belief propagation! 77 Loopy Belief Propagation for Parsing Sentence tells word 3, “Please be a verb” Word 3 tells the 3 7 link, “Sorry, then you probably don’t exist” The 3 7 link tells the Tree factor, “You’ll have to find another parent for 7” The tree factor tells the 10 7 link, “You’re on!” The 10 7 link tells 10, “Could you please be a noun?” … … find preferred links … 78 Loopy Belief Propagation for Parsing Higher-order factors (e.g., Grandparent) induce loops … Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links … find preferred links … 79 Loopy Belief Propagation for Parsing Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” ? TREE factor … find ffffff 0 ffffft 0 fffftf 0 … … fftfft 1 … … preferred tttttt 0 links … 80 Loopy Belief Propagation for Parsing How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” TREE factor But this is the outside probability of green link! ffffff 0 ffffft 0 TREE factor computes all outgoing messages at once fffftf 0 (given all incoming messages) … … fftfft 1 … … Projective case: total O(n3) time by inside-outside tttttt 0 Non-projective: total O(n3) time by inverting Kirchhoff ? … find preferred links … matrix (Smith & Smith, 2007) 81 Loopy Belief Propagation for Parsing How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?” But this is the outside probability of green link! TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n3) time by inside-outside Non-projective: total O(n3) time by inverting Kirchhoff matrix (Smith & Smith, 2007) Belief propagation assumes incoming messages to TREE are independent. So outgoing messages can be computed with first-order parsing algorithms (fast, no grammar constant). 82 Some connections … Parser stacking (Nivre & McDonald 2008, Martins et al. 2008) Global constraints in arc consistency Matching constraint in max-product BP ALLDIFFERENT constraint (Régin 1994) For computer vision (Duchi et al., 2006) Could be used for machine translation As far as we know, our parser is the first use of global constraints in sum-product BP. 83 Outline Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing Old New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 84 Runtimes for each factor type (see paper) Factor type Tree Proj. Tree degree runtime O(n2) O(n3) O(n2) O(n3) count total 1 O(n3) 1 O(n3) Individual links 1 O(1) O(n2) O(n2) Grandparent 2 O(1) O(n3) O(n3) Sibling pairs 2 O(1) O(n3) O(n3) Sibling bigrams O(n) O(n2) O(n) O(n3) NoCross O(n) O(n) O(n2) O(n3) Tag 1 O(g) O(n) O(n) TagLink 3 O(g2) O(n2) O(n2) TagTrigram O(n) O(ng3) 1 TOTAL Additive, not multiplicative! + O(n) = O(n ) 3 per iteration 85 Runtimes for each factor type (see paper) Factor type Tree Proj. Tree degree runtime O(n2) O(n3) O(n2) O(n3) count total 1 O(n3) 1 O(n3) Individual links 1 O(1) O(n2) O(n2) Grandparent 2 O(1) O(n3) O(n3) Sibling pairs 2 O(1) O(n3) O(n3) Sibling bigrams O(n) O(n2) O(n) O(n3) NoCross O(n) O(n) O(n2) O(n3) Tag 1 O(g) O(n) O(n) TagLink 3 O(g2) O(n2) O(n2) TagTrigram O(n) O(ng3) 1 + O(n) Each “global” factorAdditive, coordinates an unbounded # of variables TOTAL O(n ) not multiplicative! = Standard belief propagation would take exponential time 3 to iterate over all configurations of those variables See paper for efficient propagators 86 Experimental Details Decoding Training Run several iterations of belief propagation Get final beliefs at link variables Feed them into first-order parser This gives the Min Bayes Risk tree (minimizes expected error) BP computes beliefs about each factor, too … … which gives us gradients for max conditional likelihood. (as in forward-backward algorithm) Features used in experiments First-order: Individual links just as in McDonald et al. 2005 Higher-order: Grandparent, Sibling bigrams, NoCross 87 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) Tree+Link Danish 85.5 Dutch 87.3 English 88.6 +NoCross 86.1 88.3 89.1 +Grandparent 86.1 88.6 89.4 +ChildSeq 86.5 88.5 90.1 88 Dependency Accuracy The extra, higher-order features help! (non-projective parsing) Tree+Link Danish 85.5 Dutch 87.3 English 88.6 +NoCross 86.1 88.3 89.1 +Grandparent 86.1 88.6 89.4 +ChildSeq 86.5 88.5 90.1 86.0 84.5 90.2 86.1 87.6 90.2 exact, slow Best projective parse with all factors doesn’t fix +hill-climbing enough edges 89 Time vs. Projective Search Error iterations iterations …DP 140 iterations Compared with O(n4) DP Compared with O(n5) DP 90 Outline Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing Old New! Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments Conclusions 92 Freedom Regained This paper in a nutshell Output probability defined as product of local and global factors Let local factors negotiate via “belief propagation” Throw in any factors we want! (log-linear model) Each factor must be fast, but they run independently Each bit of syntactic structure is influenced by others Some factors need combinatorial algorithms to compute messages fast e.g., existing parsing algorithms using dynamic programming Compare reranking or stacking Each iteration takes total time O(n3) or even O(n2); see paper Converges to a pretty good (but approximate) global parse Fast parsing for formerly intractable or slow models Extra features of these models really do help accuracy 93 Future Opportunities Efficiently modeling more hidden structure Beyond dependencies Constituency parsing, traces, lattice parsing Beyond parsing POS tags, link roles, secondary links (DAG-shaped parses) Alignment, translation Bipartite matching and network flow Joint decoding of parsing and other tasks (IE, MT, reasoning ...) Beyond text Image tracking and retrieval Social networks 94 thank you 95