Approximation-aware Dependency Parsing by Belief Propagation Matt Gormley Mark Dredze Jason Eisner September 19, 2015 TACL at EMNLP 1 Motivation #1: Approximation-unaware Learning Problem: Approximate inference causes standard learning algorithms to go awry (Kulesza & Pereira, 2008) with exact inference: with approx. inference: Can we take our approximations into account? 2 Motivation #2: Hybrid Models Graphical models let you Neural nets are really encode domain good at fitting the data knowledge discriminatively to make good predictions … … … Could we define a neural net that incorporates domain knowledge? 3 Our Solution Key idea: Treat your unrolled approximate inference algorithm as a deep network … … … … … … Chart parser: … … … … 4 Talk Summary Loopy BP + Dynamic Prog. = Structured BP Loopy BP + Backprop. = ERMA / Back-BP Loopy BP + Dynamic Prog. + Backprop. = This Talk 5 Loopy BP + Dynamic Prog. + Backprop. = This Talk Graphical + Hypergraphs Models = The models that interest me + Neural Networks • If you’re thinking, “This sounds like a great direction!” • Then you’re in good company • And have been since before 1995 6 Loopy BP + Dynamic Prog. + Backprop. = This Talk Graphical + Hypergraphs Models = The models that interest me + Neural Networks • So what’s new since 1995? • Two new emphases: 1. Learning under approximate inference 2. Structural constraints 7 An Abstraction for Modeling True 0.1 False 5.2 ψ2 y1 ψ12 False Factor Graph (bipartite graph) • variables (circles) • factors (squares) True Mathematical Modeling True 2 9 False 4 2 y2 8 Factor Graph for Dependency Parsing Y0,4 Y0,3 Y0,2 Y0,1 Y1,4 Y1,3 Y1,2 Y2,1 Y4,1 Y2,4 Y3,1 Y2,3 Y3,2 Y4,2 Y3,4 Y4,3 9 Factor Graph for Dependency Parsing Y0,4 Y0,3 Y0,2 Y1,4 Y1,3 Y4,1 Y2,4 Y3,1 Y4,2 Left Right arc arc Y0,1 0 <WALL> Y1,2 1 Juan_Carlos Y2,3 Y2,1 2 abdica Y3,2 Y3,4 3 su Y4,3 4 reino 10 Factor Graph for Dependency Parsing ✔ 0 <WALL> 1 Juan_Carlos ✔ Left Right arc arc ✔ 2 abdica 3 su ✔ 4 reino 11 Factor Graph for Dependency Parsing Unary: local opinion about one edge ✔ 0 <WALL> 1 Juan_Carlos ✔ ✔ 2 abdica 3 su ✔ 4 reino 12 Factor Graph for Dependency Parsing Unary: local opinion about one edge ✔ 0 <WALL> 1 Juan_Carlos ✔ ✔ 2 abdica ✔ 3 su ✔ 4 reino 13 Factor Graph for Dependency Parsing PTree: Hard constraint, multiplying in 1 if the variables form a tree and 0 otherwise. Unary: local opinion about one edge ✔ 0 <WALL> 1 Juan_Carlos ✔ ✔ 2 abdica 3 su ✔ 4 reino 14 Factor Graph for Dependency Parsing PTree: Hard constraint, multiplying in 1 if the variables form a tree and 0 otherwise. Unary: local opinion about one edge ✔ 0 <WALL> 1 Juan_Carlos ✔ ✔ 2 abdica 3 su ✔ 4 reino 15 Factor Graph for Dependency Parsing PTree: Hard constraint, multiplying in 1 if the variables form a tree and 0 otherwise. Unary: local opinion about one edge ✔ 0 <WALL> 1 Juan_Carlos ✔ ✔ 2 abdica Grandparent: local opinion about grandparent, head, and modifier 3 su ✔ 4 reino 16 Factor Graph for Dependency Parsing PTree: Hard constraint, multiplying in 1 if the variables form a tree and 0 otherwise. Unary: local opinion about one edge ✔ 0 <WALL> 1 Juan_Carlos ✔ ✔ 2 abdica 3 su ✔ Grandparent: local opinion about grandparent, head, and modifier Sibling: local opinion about pair of arbitrary siblings 4 reino 17 Factor Graph for Dependency Parsing Y0,4 Y0,3 Y0,2 Y0,1 Y1,4 Y1,3 Y1,2 Y2,1 Y4,1 Y2,4 Y3,1 Y2,3 Y3,2 Y4,2 Y3,4 Y4,3 Now we can work at this level of abstraction. Why dependency parsing? 1. Simplest example for Structured BP 2. Exhibits both polytime and NP-hard problems 19 The Impact of Approximations Model Linguistics pθ( ) = 0.50 Inference Learning (Inference is usually called as a subroutine in learning) 20 The Impact of Approximations Model Linguistics pθ( ) = 0.50 Inference Learning (Inference is usually called as a subroutine in learning) 21 The Impact of Approximations Model Linguistics pθ( ) = 0.50 Inference Learning (Inference is usually called as a subroutine in learning) 22 Machine Learning 1. Conditional Log-likelihood Training Choose model Such that derivative in #3 is ea 2. Choose objective: Assign high probability to the things we observe and low probability to everything else 3. Compute derivative by hand using the chain rule 4. Replace exact inference by approximate inference 23 Machine Learning 1. Conditional Log-likelihood Training Choose model (3. comes from log-linear factors) 2. Choose objective: Assign high probability to the things we observe and low probability to everything else 3. Compute derivative by hand using the chain rule 4. Replace exact inference by approximate inference 24 Machine Learning What’s wrong with CLL? How did we compute these approximate marginal probabilities anyway? By Structured Belief Propagation of course! 25 Everything you need to know about: Structured BP 1. It’s a message passing algorithm 2. The message computations are just multiplication, addition, and division 3. Those computations are differentiable 26 Inference Structured Belief Propagation This is just another factor graph, so we can run Loopy BP ✔ 0 <WALL> 1 Juan_Carlos ✔ ✔ 2 abdica 3 su ✔ What goes wrong? • Naïve computation is inefficient • We can embed the insideoutside algorithm within the structured factor 4 reino 27 Algorithmic Differentiation • Backprop works on more than just neural networks • You can apply the chain rule to any arbitrary differentiable algorithm That’s the key (old) idea behind this talk. • Alternatively: could estimate a gradient by finite-difference approximations – but algorithmic differentiation is much more efficient! 28 Feed-forward Topology of Inference, Decoding and Loss • Unary factor: vector with 2 entries • Binary factor: (flattened) matrix with 4 entries … … Factors Model parameters 29 Feed-forward Topology of Inference, Decoding and Loss • Messages from neighbors used to compute next message • Leads to sparsity in layerwise connections … … … Messages at time t=1 … Messages at time t=0 Factors Model parameters 30 Arrows in This Arrows in Neural Net: Feed-forward Topology of Diagram: A different semantics Linear combination, then Inference, Decoding and Loss given by the algorithm a sigmoid … … … Messages at time t=1 … Messages at time t=0 Factors Model parameters 31 Arrows in This Arrows in Neural Net: Feed-forward Topology of Diagram: A different semantics Linear combination, then Inference, Decoding and Loss given by the algorithm a sigmoid … … … Messages at time t=1 … Messages at time t=0 Factors Model parameters 32 Feed-forward Topology Decode / Loss … … … Beliefs … Messages at time t=3 … Messages at time t=2 … Messages at time t=1 … Messages at time t=0 Factors Model parameters 33 in This Diagram: Feed-forwardArrows Topology A different semantics Decode / Loss given by the algorithm … Messages from PTree factor rely on a variant of insideoutside … … Beliefs … Messages at time t=3 … Messages at time t=2 … Messages at time t=1 … Messages at time t=0 Factors Model parameters 34 Feed-forward Topology … Messages from PTree factor rely on a variant of insideoutside … … Chart parser: … … … … 35 Machine Learning 1. 2. 3. 4. Approximation-aware Learning Choose model to be the computation with all its approximations Choose objective to likewise include the approximations Compute derivative by backpropagation (treating the entire computation as if it were a neural network) Make no approximations! (Our gradient is exact) Key idea: Open up the black box! … … … Chart parser: … … … … 36 Experimental Setup Goal: Compare two training approaches 1. Standard approach (CLL) 2. New approach (Backprop) Data: English PTB – Converted to dependencies using Yamada & Matsumoto (2003) head rules – Standard train (02-21), dev (22), test (23) split – TurboTagger predicted POS tags Metric: Unlabeled Attachment Score (higher is better) 37 Results New training approach yields models which are: 1. Faster for a given level of accuracy 2. More accurate for a given level of speed More accurate Unlabeled Attachment Score (UAS) Speed-Accuracy Tradeoff Dependency Parsing 93 92 91 90 89 88 1 2 3 4 5 6 7 # Iterations of BP CLL Backprop 8 Faster 38 Results • As we add more factors to the model, our model becomes loopier • Yet, our training by Backprop consistently improves as models get richer More accurate Increasingly Cyclic Models Unlabeled Attachement Score (UAS) Dependency Parsing 93 92 91 90 CLL Backprop Richer Models 39 See our TACL paper for… 2) Results with alternate training objectives 93.0 92.5 92.0 92 UAS UAS 1) Results on 19 languages from CoNLL 2006 / 2007 91.0 CLL L2 90.0 3) Empirical comparison of exact and approximate inference CLL L2 L 2 +AR 91.5 91 L 2 +AR 89.0 90.5 Unary 88.0 1 91.0 CLL L2 L 2 +AR 88.0 1 2 3 4 5 6 # Iterations of BP 7 7 8 Grand. Sib. Grand.+Sib. Figure 4: English PTB-YM UAS vs. the types of 2ndorder factors included in the model for approximationaware training and standard conditional likelihood training. All models include 1st-order factors (Unary). The 2nd-order models include grandparents (Grand.), arbitrary siblings (Sib.), or both (Grand.+Sib.)—and use 4 iterations of BP. UAS UAS 92.0 89.0 3 4 5 6 # Iterations of BP Figure 92.5 3: Speed/accuracy tradeoff of English PTB-YM CLL number of BP iterations t max for UAS vs. the total L 2 likelihood training (CLL) and our standard conditional 92 approximation-aware training with either an L 2 objective L 2 +AR (L 2 ) or a staged training of L 2 followed by annealed risk (L 291.5 +AR). Note that the x-axis shows the number of iterations used for both training and testing. We use a 2ndorder91model with Grand.+Sib. factors. 93.0 90.0 2 8 Figure 3: Speed/accuracy tradeoff of English PTB-YM UAS vs. the total number of BP iterations t max for standard conditional likelihood training (CLL) and our approximation-aware training with either an L 2 objective (L 2 ) or a staged training of L 2 followed by annealed risk likelihood training (CLL) (§6). As is common in the literature, we conflate two distinct learning settings (conditional log-likelihood/surrogate log90.5tags for the CoNLL languages, and predicted likelihood) under the single name “ CLL,” allowing POS Unary Grand. Grand.+Sib. tags from TurboTagger (Martins et Sib. al., 2013) for the the inference method (exact/inexact) to differentiate PTB. Unlike most prior work, we hold out 10% of them. The second learning setting is approximationFigure 4: English PTB-YM UAS the types of 2nd- aware learning (§3) with either our L distance obeach CoNLL training dataset as vs. development data 2 order factors included in the model 9for approximation- jective (L ) (§4.2) or our layer-wise training method for regularization by early stopping. 2 aware training and standard conditional likelihood trainSome of the CoNLL languages contain non- (L +AR) which takestheL 2-trained model asan iniing. All models include 1st-order factors (Unary). The 2 projective models edges, include but our grandparents system is built using arbia tializer for our annealed risk (§4.3). The annealed 2nd-order (Grand.), probability distribution over projective trees only. trary siblings (Sib.), or both (Grand.+Sib.)—and use 4 risk objective requires an annealing schedule: over ERMA can iterations of still BP. be used with such a badly misspec- the course of training, we linearly anneal from initial temperature T = 0.1 to T = 0.0001, updat- 40 Machine Learning Comparison of Two Approaches 1. CLL with approximate inference – A totally ridiculous thing to do! – But it’s been done for years because it often works well – (Also named “surrogate likelihood training” by Wainright (2006)) 41 Machine Learning Comparison of Two Approaches Key idea: Open up the black box! 2. Approximation-aware Learning for NLP – In hindsight, treating the approximations as part of the model is the obvious thing to do (Domke, 2010; Domke, 2011; Stoyanov et al., 2011; Ross et al., 2011; Stoyanov & Eisner, 2012; Hershey et al., 2014) – Our contribution: Approximation-aware learning with structured factors – But there's some challenges to get it right (numerical stability, efficiency, backprop through structured factors, annealing a decoder’s argmin) – Sum-Product Networks are similar in spirit (Poon & Domingos, 2011; Gen & Domingos, 2012) 42 Takeaways • New learning approach for Structured BP maintains high accuracy with fewer iterations of BP, even with cycles • Need a neural network? Treat your unrolled approximate inference algorithm as a deep network 43 Questions? Pacaya - Open source framework for hybrid graphical models, hypergraphs, and neural networks Features: – Structured BP – Coming Soon: Approximation-aware training Language: Java URL: https://github.com/mgormley/pacaya