Penalized EP for Graphical Models Over Strings Ryan Cotterell and Jason Eisner Natural Language is Built from Words Can store info about each word in a table Index Spelling Meaning Pronunciation Syntax 123 ca [si.ei] NNP (abbrev) 124 can [kɛɪn] NN 125 can [kæn], [kɛn], … MD 126 cane [keɪn] NN (mass) 127 cane [keɪn] NN 128 canes [keɪnz] NNS Problem: Too Many Words! • Technically speaking, # words = • Really the set of (possible) words is ∑* • • • • Names Neologisms Typos Productive processes: – friend friendless friendlessness friendlessnessless … – hand+bag handbag (sometimes can iterate) Solution: Don’t model every cell separately Positive ions Noble gases Can store info about each word in a table Index Spelling Meaning Pronunciation Syntax 123 ca [si.ei] NNP (abbrev) 124 can [kɛɪn] NN 125 can [kæn], [kɛn], … MD 126 cane [keɪn] NN (mass) 127 cane [keɪn] NN 128 canes [keɪnz] NNS Can store info about each word in a table Index Spelling Meaning Pronunciation Syntax Ultimate goal: Probabilistically reconstruct all missing entries of this 123 ca [si.ei] infinite multilingual table, given some entries and some text. NNP (abbrev) 124 can [kɛɪn] NN 125 can [kæn], [kɛn], … MD Approach: Linguistics + generative modeling + statistical inference. Modeling ingredients: Finite-state machines + graphical models. 126 cane [keɪn] NN (mass) Inference ingredients: Expectation Propagation 127 cane [keɪn] (this talk). NN 128 canes [keɪnz] NNS Can store info about each word in a table Index Spelling Meaning Pronunciation Syntax Ultimate goal: Probabilistically reconstruct all missing entries of this 123 ca [si.ei] infinite multilingual table, given some entries and some text. NNP (abbrev) 124 can [kɛɪn] NN 125 can [kæn], [kɛn], … MD Approach: Linguistics + generative modeling + statistical inference. Modeling ingredients: Finite-state machines + graphical models. 126 cane [keɪn] NN (mass) Inference ingredients: Expectation Propagation 127 cane [keɪn] (this talk). NN 128 canes [keɪnz] NNS Predicting Pronunciations of Novel Words (Morpho-Phonology) eɪʃən dæmnz ???? damns dæmn dæmneɪʃən How do you pronounce this word? dˌæmnˈeɪʃən damnation z rizajgnz rizajgn rizajgneɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən resigns resignation Predicting Pronunciations of Novel Words (Morpho-Phonology) eɪʃən dæmnz dˌæmz damns dæmn dæmneɪʃən How do you pronounce this word? dˌæmnˈeɪʃən damnation z rizajgnz rizajgn rizajgneɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən resigns resignation Graphical Models over Strings • Use Graphical Model Framework to model many strings jointly! aardvark 0.1 4 rung 5 … … ring 2 4 0.1 rang 7 1 2 rung 8 1 3 aardvark 0.1 ψ1 … ring X1 rung 3 ring rang rang … … … aardvark rung X1 rang ψ1 ring 1 rang 2 rung 2 ring X1 s r ae h i n g e u ε e ψ1 r s s r e ε X2 s r 0.2 0.1 0.1 … X2 ring 10.2 rang 13 rung 16 X2 rang 0.1 2 4 0.1 ring 0.1 7 1 2 rung 0.2 8 1 3 … a u i n a n u a ε a e g g ae h i n g e u ε 11 Zooming in on a WFSA • Compactly represents an (unnormalized) probability distribution over all strings in • Marginal belief: How do we pronounce damns? • Possibilities: /damz/, /dams/, /damnIz/, etc.. n/.25 d/1 a/1 m/1 z/.5 s/.25 z/1 I/1 z/1 Log-Linear Approximation • Given a WFSA distribution p, find a log-linear approximation q – min KL(p || q) “inclusive KL divergence” – q corresponds to a smaller/tidier WFSA • Two Approaches: – Gradient-Based Optimization (Discussed Here) – Closed Form Optimization ML Estimation = Moment Matching bar = 2 Broadcast n-gram counts foo 1.2 bar 0.5 baz 4.3 Fit model that predicts same counts FSA Approx. = Moment Matching s ae h r i n g s ua e h er i ε ne g e e e u ε e bar = 2 Compute with forward-backward! foo 1.2 bar 0.5 baz 4.3 Fit model that predicts same counts Gradient-Based Minimization • Objective: • Gradient with respect to Arc weights are determined by a parameter vector - just like a log-linear model • Difference between two expectations of feature counts, which are determined by the weighted DFA q • Features are just n-gram counts! Does q need a lot of features? • Game: what order of n-grams do we need to put probability 1 on a string? • Word 1: noon – Bigram model? No - Trigram model • Word 2: papa – Trigram model? No - 4-gram model - very big! • Word 3: abracadabra – 6-gram model – way too big! Variable Order Approximations • Intuition: In NLP marginals are often peaked – Probability mass mostly on a few similar strings! • q should reward a few long n-grams – also need short n-gram features for backoff ^abrab 5.0 abraca 5.0 zzzzzz -500 6-gram table. Too Big! Variable order table. Very Small! abra 5.0 ^a 5.0 b 4.3 Variable Order Approximations • Moral: Use only the n-grams you really need! Belief Propagation (BP) in a Nutshell X6 X3 X1 X4 X2 X5 Belief Propagation (BP) in a Nutshell X6 n/.25 X1 d/1 m/1 a/1 X2 X3 z/.5 s/.25 z/1 I/1 X5 X4 z/1 Belief Propagation (BP) in a Nutshell X6 X3 X1 X4 X2 X5 Computing Marginal Beliefs X7 X3 X1 X4 X2 X5 Computing Marginal Beliefs X7 X3 X1 X4 X2 X5 Belief Propagation (BP) in a Nutshell X6 X1 s ae h r i ng eu ε e e s ae h r i n g e u ε s ae h r i ng u e es ae h e ε X3 r i n g eu ε X4 X2 X5 Computing Marginal Beliefs X7 s ae h r i n g eu ε s ae h r i n g X3 eu ε s ae h r i n g eu ε X1 s ae h r i n g eu ε X4 X2 X5 Computing Marginal Beliefs X7 X1 s ae h Computation of belief r i n g results in large state e us εae h space s sareaei hnhg r ri einung εgs a h s ae hes ueaueε εh r i en g r i n gr siX3naeg h u e ε eeur ihε n g e u εs a r i ne ug ε s eaue h r i n εg Ce u X4 ε X2 X5 Computing Marginal Beliefs X7 X1 X2 s ae h Computation of belief r i n g results in large state e us εae h space s sareaei hnhg r ri einung εgs a h s ae hes ueaueε εh r i en g r i n gr siX3naeg h u e ε eeur ihε n g e u εs a r i ne ug ε s eaue h r i n εg Ce u X4 ε What a hairball! X5 Computing Marginal Beliefs X7 s ae h r i n g eu ε Approximation Required!!! s ae s a h e r i n g X3 eu ε s ae h r i n g eu ε X1 h r i n g eu ε X4 X2 X5 BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! a a a X1 ψ2 ψ1 X2 a a ε a a a BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! a a a X1 ψ2 ψ1 X2 a a a ε a a a BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! a a a X1 ψ2 ψ1 X2 a a a a a a ε a a BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! a a a X1 ψ2 ψ1 X2 a a a a a a ε a a a BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! a a a X1 ψ2 ψ1 X2 a a a a a a a a a a a a a a a ε a a a a a a a a a a a a a Expectation Propagation (EP) in a Nutshell X7 s ae h r i n g eu ε s ae h r i n g X3 eu ε s ae h r i n g eu ε X1 s ae h r i n g eu ε X4 X2 X5 Expectation Propagation (EP) in a Nutshell X7 foo 1.2 bar 0.5 baz 4.3 s ae h r i n g X3 eu ε s ae h r i n g eu ε X1 s ae h r i n g eu ε X4 X2 X5 Expectation Propagation (EP) in a Nutshell X7 foo 1.2 bar 0.5 baz 4.3 foo 1.2 bar 0.5 baz 4.3 X1 X3 s ae h r i n g eu ε s ae h r i n g eu ε X4 X2 X5 Expectation Propagation (EP) in a Nutshell X7 foo 1.2 bar 0.5 baz 4.3 foo 1.2 bar 0.5 baz 4.3 X1 X3 foo 1.2 bar 0.5 baz 4.3 s ae h r i n g eu ε X4 X2 X5 Expectation Propagation (EP) in a Nutshell X7 foo 1.2 bar 0.5 baz 4.3 foo 1.2 bar 0.5 baz 4.3 X1 X3 foo 1.2 bar 0.5 baz 4.3 foo 1.2 bar 0.5 baz 4.3 X4 X2 X5 EP In a Nutshell X7 foo 1.2 bar 0.5 baz 4.3 foo Approximate belief is now a table of n-grams. foo 1.2 4.8 bar 0.5 The point-wise product baz 4.3 is now super easy! 2.0 foo 1.2 bar X 3 bar 0.5 baz 4.3 17.2 baz foo 1.2 bar 0.5 baz 4.3 X1 X4 X2 X5 How to approximate a message? KL( s ae h i n g u ε s ae h r i n g eu ε foo 1.2 bar 0.5 baz 4.3 s ae h = i ng u ε foo bar baz || foo 0.2 bar 1.1 baz -0.3 s ae h = i ng u ε foo 1.2 bar 0.5 baz 4.3 θ foo bar baz foo 0.2 bar 1.1 baz -0.3 ) Minimize with respect to the parameters θ Results • Question 1: Does EP work in general (comparison to baseline)? • Question 2: Do variable order approximations improve over fixed n-grams? • Unigram EP (Green) – fast, but inaccurate • Bigram EP (Blue) – also fast and inaccurate • Trigram EP (Cyan) – slow and accurate • Penalized EP (Red) – fast and accurate • Baseline (Black) – accurate and slow (pruning based) Fin Thanks for you attention! For more information on structured models and belief propagation, see the Structured Belief Propagation Tutorial at ACL 2015 by Matt Gormley and Jason Eisner.