Learning Approximate Inference Policies for Fast Prediction Jason Eisner ICML “Inferning” Workshop June 2012 1 Beware: Bayesians in Roadway A Bayesian is the person who writes down the function you wish you could optimize semantics lexicon (word types) entailment correlation inflection cognates transliteration abbreviation neologism language evolution tokens sentences N translation alignment editing quotation discourse context resources speech misspellings,typos formatting entanglement annotation To recover variables, model and exploit their correlations Motivating Tasks Structured prediction (e.g., for NLP problems) Parsing ( trees) Machine translation ( word strings) Word variants ( letter strings, phylogenies, grids) Unsupervised learning via Bayesian generative models Given a few verb conjugation tables and a lot of text Find/organize/impute all verb conjugation tables of the language Motivating Tasks Structured prediction (e.g., for NLP problems) Parsing ( trees) Machine translation ( word strings) Word variants ( letter strings, phylogenies, grids) Unsupervised learning via Bayesian generative models Given a few verb conjugation tables and a lot of text Find/organize/impute all verb conjugation tables of the language Given some facts and a lot of text Discover more facts through information extraction and reasoning Current Methods Dynamic programming Approximate inference in graphical models Exact but slow Are approximations any good? May use dynamic programming as subroutine (structured BP) Sequential classification Speed-Accuracy Tradeoffs Inference requires lots of computation Is some computation going to waste? Is some computation actively harmful? Sometimes the best prediction is overdetermined … Quick ad hoc methods sometimes work: how to respond? In approximate inference, passing a message can hurt Frustrating to simplify model just to fix this Want to keep improving our models! But need good fast approximate inference Choose approximations automatically Tuned to data distribution & loss function “Trainable hacks” – more robust This talk is about “trainable hacks” training data Prediction device feedback likelihood (suitable for domain) This talk is about “trainable hacks” training data Prediction device feedback loss + runtime (suitable for domain) Bayesian Decision Theory Loss Optimized parameters of prediction rule Data distribution Prediction rule What prediction rule? (approximate inference + beyond) What loss function? (can include runtime) How to optimize? (backprop, RL, …) What data distribution? (may have to impute) This talk is about “trainable hacks” Probabilistic domain model Complete training data Prediction device Partial data feedback loss + runtime (suitable for domain) Part 1: Your favorite approximate inference algorithm is a trainable hack General CRFs: Unrestricted model structure Y2 . Y1 Y4 X1 X2 Y3 X3 Add edges to model the conditional distribution well. But exact inference is intractable. So use loopy sum-product or max-product BP. 14 General CRFs: Unrestricted model structure Inference: compute properties of the posterior distribution. DT .9 NN .05 … NN .8 JJ .1 … VBD .7 VB .1 … IN .9 NN .01 … DT .9 NN .05 … NN .4 JJ .3 … . .99 , .001 … The cat sat on the mat . 15 General CRFs: Unrestricted model structure Decoding: coming up with predictions from the results of inference. DT NN VBD IN DT NN . The cat sat on the mat . 16 General CRFs: Unrestricted model structure One uses CRFs with several approximations: Approximate inference. Approximate decoding. Mis-specified model structure. MAP training (vs. Bayesian). Could be present in linear-chain CRFs as well. Why are we still maximizing data likelihood? Our system is more like a Bayes-inspired neural network that makes predictions. 17 Train directly to minimize task loss (Stoyanov, Ropson, & Eisner 2011; Stoyanov & Eisner 2012) x (Appr.) (Appr.) Inference Decoding Black box decision p(y|x) function ŷ parameterized by ϴ L(y*,ŷ) Adjust ϴ to (locally) minimize training loss E.g., via back-propagation (+ annealing) “Empirical Risk Minimization under Approximations (ERMA)” 18 Optimization Criteria Approximation Aware Yes Yes Loss Aware No No 19 Optimization Criteria Approximation Aware Yes MLE Yes Loss Aware No No 20 Optimization Criteria Approximation Aware Yes MLE SVMstruct Yes Loss Aware No No [Finley and Joachims, 2008] M3N [Taskar et al., 2003] Softmax-margin [Gimpel & Smith, 2010] 21 Optimization Criteria Approximation Aware Yes MLE SVMstruct Yes Loss Aware No No ERMA [Finley and Joachims, 2008] M3N [Taskar et al., 2003] Softmax-margin [Gimpel & Smith, 2010] 22 Experimental Results 3 NLP problems; also synthetic data We show that: General CRFs work better when they match dependencies in the data. Minimum risk training results in more accurate models. ERMA software package available at www.clsp.jhu.edu/~ves/software 23 ERMA software package http://www.clsp.jhu.edu/~ves/software Includes syntax for describing general CRFs. Supports sum-product and max-product BP. Can optimize several commonly used loss functions: MSE, Accuracy, F-score. The package is generic: Little effort to model new problems. About1-3 days to express each problem in our formalism. 24 Modeling Congressional Votes First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… The ConVote corpus [Thomas et al., 2006] 25 Modeling Congressional Votes First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Yea The ConVote corpus [Thomas et al., 2006] 26 Modeling Congressional Votes Mr. Sensenbrenner First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Yea Had it not been for the heroic actions of the passengers of United flight 93 who forced the plane down over Pennsylvania, congress's ability to serve … Yea The ConVote corpus [Thomas et al., 2006] 27 Modeling Congressional Votes Mr. Sensenbrenner First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Yea Had it not been for the heroic actions of the passengers of United flight 93 who forced the plane down over Pennsylvania, congress's ability to serve … Yea The ConVote corpus [Thomas et al., 2006] 28 Modeling Congressional Votes • Predict representative votes based on debates. Y/N An example from the ConVote corpus [Thomas et al., 2006] 29 Modeling Congressional Votes • Predict representative votes based on debates. Y/N Text First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… An example from the ConVote corpus [Thomas et al., 2006] 30 Modeling Congressional Votes • Predict representative votes based on debates. Y/N Y/N Contex t Text First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Text An example from the ConVote corpus [Thomas et al., 2006] 31 Modeling Congressional Votes Accuracy Non-loopy baseline (2 SVMs + min-cut) 71.2 32 Modeling Congressional Votes Accuracy Non-loopy baseline (2 SVMs + min-cut) 71.2 Loopy CRF models (inference via loopy sum-prod BP) 33 Modeling Congressional Votes Accuracy Non-loopy baseline (2 SVMs + min-cut) 71.2 Loopy CRF models (inference via loopy sum-prod BP) Maximum-likelihood training (with approximate inference) 78.2 34 Modeling Congressional Votes Accuracy Non-loopy baseline (2 SVMs + min-cut) 71.2 Loopy CRF models (inference via loopy sum-prod BP) Maximum-likelihood training (with approximate inference) Softmax-margin (loss-aware) 78.2 79.0 35 Modeling Congressional Votes Accuracy Non-loopy baseline (2 SVMs + min-cut) 71.2 Loopy CRF models (inference via loopy sum-prod BP) Maximum-likelihood training (with approximate inference) Softmax-margin (loss-aware) ERMA (loss- and approximation-aware) 78.2 79.0 84.5 *Boldfaced results are significantly better than all others (p < 0.05) 36 Information Extraction from SemiStructured Text What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of Technology Topic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737) ABSTRACT: We will demonstrate the system "automata" that implements finite state machines… … After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package CMU Seminar Announcement Corpus [Freitag, 2000] 37 Information Extraction from SemiStructured Text What: Special Seminar Who: Prof. Klaus Sutner speaker Computer Science Department, Stevens Institute of Technology Topic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 start time Time: 12:00 pm location Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737) ABSTRACT: We will demonstrate the system "automata" that implements finitespeaker state machines… … After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package CMU Seminar Announcement Corpus [Freitag, 2000] 38 Skip-Chain CRF for Info Extraction Extract speaker, location, stime, and etime from seminar announcement emails O S S S Who: Prof. Klau s Sutne r …… …… S S O Prof. Sutne r will CMU Seminar Announcement Corpus [Freitag, 2000] Skip-chain CRF [Sutton and McCallum, 2005; Finkel et al., 2005] 39 Semi-Structured Information Extraction F1 Non-loopy baseline (linear-chain CRF) Non-loopy baseline + ERMA (trained for loss instead of likelihood) 86.2 87.1 40 Semi-Structured Information Extraction F1 Non-loopy baseline (linear-chain CRF) Non-loopy baseline + ERMA (trained for loss instead of likelihood) 86.2 87.1 Loopy CRF models (inference via loopy sum-prod BP) Maximum-likelihood training (with approximate inference) 89.5 41 Semi-Structured Information Extraction F1 Non-loopy baseline (linear-chain CRF) Non-loopy baseline + ERMA (trained for loss instead of likelihood) 86.2 87.1 Loopy CRF models (inference via loopy sum-prod BP) Maximum-likelihood training (with approximate inference) Softmax-margin (loss-aware) 89.5 90.2 42 Semi-Structured Information Extraction F1 Non-loopy baseline (Linear-chain CRF) Non-loopy baseline + ERMA (trained for loss instead of likelihood) 86.2 87.1 Loopy CRF models (inference via loopy sum-prod BP) Maximum-likelihood training (with approximate inference) Softmax-margin (loss-aware) ERMA (loss- and approximation-aware) 89.5 90.2 *Boldfaced results are significantly better than all others (p < 0.05). 90.9 43 Collective Multi-Label Classification The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Oil Libya Sport s Reuters Corpus Version 2 [Lewis et al, 2004] 44 Collective Multi-Label Classification The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Oil Libya Sport s Reuters Corpus Version 2 [Lewis et al, 2004] 45 Collective Multi-Label Classification The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Oil Libya Sport s 46 Collective Multi-Label Classification The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Oil Libya Sport s [Ghamrawi and McCallum, 2005; Finley and Joachims, 2008] 47 Multi-Label Classification F1 Non-loopy baseline (logistic regression for each label) 81.6 48 Multi-Label Classification F1 Non-loopy baseline (logistic regression for each label) 81.6 Loopy CRF models (inference via loopy sum-prod BP) Maximum-likelihood training (with approximate inference) 84.0 49 Multi-Label Classification F1 Non-loopy baseline (logistic regression for each label) 81.6 Loopy CRF models (inference via loopy sum-prod BP) Maximum-likelihood training (with approximate inference) Softmax-margin (loss-aware) 84.0 83.8 50 Multi-Label Classification F1 Non-loopy baseline (logistic regression for each label) 81.6 Loopy CRF models (inference via loopy sum-prod BP) Maximum-likelihood training (with approximate inference) Softmax-margin (loss-aware) ERMA (loss- and approximation-aware) 84.0 83.8 84.6 *Boldfaced results are significantly better than all others (p < 0.05) 51 Summary Congressional Vote Modeling (Accuracy) Semi-str. Inf. Extraction (F1) Multi-label Classification (F1) 71.2 87.1 81.6 Maximum-likelihood training 78.2 89.5 84.0 ERMA 84.5 90.9 84.6 Non-loopy baseline Loopy CRF models 52 Synthetic Data Generate a CRF at random Random structure & parameters Use Gibbs sampling to generate data Forget the parameters Optionally add noise to the structure Learn the parameters from the sampled data Evaluate using one of four loss functions Total of 12 models of different size/connectivity 53 Synthetic Data: Results Test Loss MSE Accuracy F-Score ApprLogL Train Objective Δ Loss compared to true model ApprLogL baseline .71 MSE . 05 ApprLogL baseline . 75 Accuracy .01 ApprLogL baseline F-Score ApprLogL baseline wins/ties/losses (over 12 models) 12/0/0 11/0/1 1.17 .08 10/2/0 -.31 54 Introducing Structure Mismatch 0.025 0.02 0.015 Loss ALogL -- MSE MSE -- MSE ALogL -- F-score 0.01 F-score -- F-score 0.005 0 10% 20% 30% 40% Structure Mismatch 55 (independently done by Domke 2010, 2011) Back-Propagation of Error for Empirical Risk Minimization • Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ. • Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss. x Black box decision function parameterized by ϴ L(y*,ŷ) 56 Back-Propagation of Error for Empirical Risk Minimization • Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ. • Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss. x L(y*,ŷ) 57 Back-Propagation of Error for Empirical Risk Minimization • Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ. • Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss. x Neural network L(y*,ŷ) 58 Back-Propagation of Error for Empirical Risk Minimization • Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ. • Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss. x Neural network L(y*,ŷ) 59 Back-Propagation of Error for Empirical Risk Minimization • Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ. • Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss. Y2 x Y1 Y3 Y4 CRF System X1 X2 L(y*,ŷ) X3 60 Error Back-Propagation 61 Error Back-Propagation 62 Error Back-Propagation 63 Error Back-Propagation 64 Error Back-Propagation 65 Error Back-Propagation 66 Error Back-Propagation 67 Error Back-Propagation 68 Error Back-Propagation 69 Error Back-Propagation P(VoteReidbill77=Yea | x) ϴ VoteReidbill77 m(y1y2)=m(y3y1)*m(y4y1) 70 Error Back-Propagation • Applying the differentiation chain rule over and over. • Forward pass: – Regular computation (inference + decoding) in the model (+ remember intermediate quantities). • Backward pass: – Replay the forward pass in reverse computing gradients. 71 The Forward Pass • Run inference and decoding: Inference (loopy BP) θ messages beliefs Decoding Loss output L 72 The Backward Pass • Replay the computation backward calculating gradients: Inference (loopy BP) θ messages beliefs ð(θ) ð(messages) ð(beliefs) Decoding Loss output L ð(output) ð(L)=1 ð(f)= L/f 73 Gradient-Based Optimization • Use a local optimizer to find θ* that minimize training loss. • In practice, we use a second-order method, Stochastic Meta Descent (Schradoulph 1999). – Some more automatic differentiation magic needed to compute vector-Hessian products (Pearlmutter 1994). • Both gradient and vector-Hessian computation have the same complexity as the forward pass (small constant factor). 74 Deterministic Annealing • Some loss functions are not differentiable (e.g., accuracy) • Some inference methods are not differentiable (e.g., max-product BP). • Replace Max with Softmax and anneal. 75 Part 1: Your favorite approximate inference algorithm is a trainable hack Part 2: What other trainable inference devices can we devise? Prediction device Preferably can tune for speed-accuracy tradeoff (Horvitz 1989, “flexible computation”) (suitable for domain) 1. Lookup methods Hash tables Memory-based learning Dual-path models (look up if possible, else do deeper inference) (in general, dynamic mixtures of policies: Halpern & Pass 2010) 2. Choose Fast Model Structure Static choice of fast model structure (Sebastiani & Ramoni 1998) Learning a low-treewidth model (e.g., Bach & Jordan 2001, Narasimhan & Bilmes 2004) Learning a sparse model (e.g., Lee et al. 2007) Learning an approximate arithmetic circuit (Lowd & Domingos 2010) Dynamic choice of fast model structure Dynamic feature selection (Dulac-Arnold et al., 2011; Busa-Fekete et al., 2012; He et al., 2012; Stoyanov & Eisner, 2012) Evidence-specific tree (Chechetka & Guestrin 2010) Data-dependent convex optimization problem (Domke 2008, 2012) 3. Pruning Unlikely Hypotheses Tune aggressiveness of pruning Pipelines, cascades, beam-width selection Classifiers or competitive thresholds E.g., Taskar & Weiss 2010, Bodenstab et al. 2011 4. Pruning Work During Search Early stopping Message-passing inference (Stoyanov et al. 2011) ERMA: Increasing Speed by Early Stopping (synthetic data) 0.035 0.03 Loss 0.025 0.02 ALogL -- MSE 0.015 MSE -- MSE ALogL -- F-score 0.01 F-score -- F-score 0.005 0 100 30 20 10 Max BP Iterations 81 4. Pruning Work During Search Early stopping before convergence Message-passing inference (Stoyanov et al. 2011) Agenda-based dynamic prog. (Jiang et al. 2012) – approximate A*! Update some messages more often In generalized BP, some messages are more complex Order of messages also affects convergence rate Cf. residual BP Cf. flexible arithmetic circuit computation (Filardo & Eisner 2012) Coarsen or drop messages selectively Value of computation Cf. expectation propagation (for likelihood only) 5. Sequential Decisions with Revision Common to use sequential decision processes for structured prediction MaltParser, SEARN, etc. Algorithm example (from Joakim Nivre) P ROOT RA( LA( RA( LA( RNMOD EDUCE PMOD RA( SOBJ SBJ HIFT P) OBJ NMOD 0 1 SBJ 2 Economic news JJ NN PMOD NMOD NMOD 3 had VBD 4 5 6 NMOD 7 8 9 little effect on financial markets . JJ NN IN JJ NNS . 5. Sequential Decisions with Revision Common to use sequential decision processes for structured prediction Often treated as reinforcement learning MaltParser, SEARN, etc. Cumulative or delayed reward Try to avoid “contagious” mistakes New opportunity: Enhanced agent that can backtrack and fix errors The flip side of RL lookahead! (only in a forgiving environment) Sometimes can observe such agents (in psych lab) Or widen its beam and explore in parallel Open Questions Effective algorithm that dynamically assesses value of computation. Theorems of the following form: If true model comes from distribution P, then with high probability there exists a fast/accurate policy in the policy space. (better yet, find the policy!) Effective policy learning methods. On Policy Learning Methods … Basically large reinforcement learning problems Search in policy parameter space Policy gradient (doesn’t work) Direct search (e.g., Nelder-Mead) Search in priority space But rather strange ones! (Eisner & Daumé 2011) Policy ( priorities) trajectory reward Often, many equivalent trajectories will get the same answer Need a surrogate objective, like A* Search in trajectory space SEARN (too slow for some controllers) Loss-augmented inference (Chiang et al. 2009; McAllester et al. 2010) Response surface methodology (really searches in policy space) Integer linear programming Part 1: Your favorite approximate inference algorithm is a trainable hack Part 2: What other trainable inference devices can we devise? Part 3: Beyond ERMA to IRMA Empirical Risk Minimization under Approximations Imputed Where does p(x, y) come from? Loss Optimized parameters of prediction rule Data distribution Prediction rule Generative vs. discriminative training data vs. dev data (Raina et al. 2003) unsupervised vs. supervised data (McCallum et al. 2006) regularization vs. empirical loss (Lasserre et al. 2006) data distribution vs. decision rule (this work; cf. Lacoste-Julien 2011) Optimized parameters of prediction rule science engineering Data imputation (Little & Rubin 1987) May need to “complete” missing data What are we given? How do we need to complete it? How do we complete it? Optimized parameters of prediction rule science engineering 1. Have plenty of inputs; impute outputs “Model compression / uptraining / structure compilation” GMM -> VQ (Hedelin & Skoglund 2000) ensemble -> single classifier (Bucila et al. 2006) sparse coding -> regression or NN (Kavukcuoglu et al., 2008; Jarrett et al., 2009; Gregor & LeCun, 2010) CRF or PCFG -> local classifiers (Liang, Daume & Klein 2008) latent-variable PCFG -> deterministic sequential parser (Petrov et al. 2010) sampling instead of 1-best [stochastic] local search -> regression (Boyan & Moore 2000) k-step planning in an MDP -> classification or k'-step planning (e.g., rollout in Bertsekas 2005; Ross et al. 2011, DAgger) BN -> arithmetic circuit (Lowd & Domingos, 2010) 2. Have good outputs; impute inputs Distort inputs from input-output pairs Abu-Mostafa 1995 SVMs can be regarded as doing this too! Structured prediction: Impute possible missing inputs Impute many Chinese sentences that should translate into each observed English sentence (Li et al., 2011) 3. Insufficient training data to impute well Assumed that we have a good slow model at training time But what if we don’t? Could sample from posterior over model parameters as well … 4. Statistical Relational Learning May only have one big incomplete training example! Sample jointly from (model parameters, completions of the data) Need a censorship model to mask data plausibly Need a distribution over queries as well – query is part of (x,y) pair What model should we use here? Start with a “base MRF” to allow domain-specific inductive bias But try to respect the marginals we can get good estimates of some variables rarely observed; some values rarely observed Want IRMA ERMA as we get more and more training data Need a high-capacity model to get consistency Learn MRF close to base MRF? Use a GP based on the base MRF? Summary: Speed-Accuracy Tradeoffs Inference requires lots of computation Is some computation going to waste? Is some computation actively harmful? Sometimes the best prediction is overdetermined … Quick ad hoc methods sometimes work: how to respond? In approximate inference, passing a message can hurt Frustrating to simplify model just to fix this Want to keep improving our models! But need good fast approximate inference Choose approximations automatically Tuned to data distribution & loss function “Trainable hacks” – more robust Summary: Bayesian Decision Theory Loss Optimized parameters of prediction rule Data distribution Prediction rule What prediction rule? (approximate inference + beyond) What loss function? (can include runtime) How to optimize? (backprop, RL, …) What data distribution? (may have to impute) FIN Current Collaborators Undergrads & junior grad students Katherine Wu Jay Feldman Frank Ferraro Tim Vieira Adam Teichert Michael Paul Mid to senior grad students Matt Gormley Nick Andrews Henry Pao Wes Filardo Jason Smith Ariya Rastrow Faculty Ves Stoyanov Ben Van Durme Mark Dredze Yanif Ahmad Hal Daumé René Vidal (+ student) (+ 2 UMD students) (+ students) NLP Tasks 15-20 years of introducing new formalisms, models & algorithms across NLP Parsing Dependency, constituency, categorial, … Deep syntax Grammar induction Word-internal modeling Morphology Phonology Transliteration Named entities Translation Syntax-based (synchronous, quasi-synchronous, training, decoding) Miscellaneous Tagging, sentiment, text cat, topics, coreference, web scraping … Generic algorithms on automata, hypergraphs, graphical models Current Guiding Themes Machine learning + linguistic structure. Fashion statistical models that capture good intuitions about various kinds of linguistic structure. Develop efficient algorithms to apply these models to data. Be generic. 1. 2. 3. Principled Bayesian models of various interesting NLP domains. Discover underlying structure with little supervision Requires new learning and inference algorithms Learn fast, accurate policies for structured prediction and large-scale relational reasoning. Unified computational infrastructure for NLP and AI. A declarative programming language that supports modularity Backed by a searchable space of strategies & data structures Fast but Principled Reasoning to Analyze Data Principled: Fast prediction: New models suited to the data + new inference algorithms for those models = draw appropriate conclusions from data Inference algorithms + approximations trained to balance speed & acc. = 80% of the benefit at 20% of the cost Reusable frameworks for modeling & prediction Word-Internal Modeling Variation in a name within and across languages E step: re-estimate distribution over all spanning trees Requires: Corpus model with sequential generation, copying, mutation M step: re-estimate name mutation model along likely tree edges Required: Trainable parametric model of name mutation Word-Internal Modeling Variation in a name within and across languages Word-Internal Modeling Spelling of named entities The “gazetteer problem” in NER systems Using gazetteer features helps performance on in-gazetteer names. But hurts performance on out-of-gazetteer names! Spelling features essentially do not learn from the in-gazetteer names. Solution: Generate your gazetteer Treat the gazetteer itself as training data for a generative model of entity names. Include this sub-model within a full NER model. Includes spelling features. Non-parametric model generates good results. Not obvious how, especially for a discriminative NER model. Can exploit additional gazetteer data, such as town population. Problem & solution extend to other dictionary resources in NLP Acronyms, IM-speak, cognate translations, … Word-Internal Modeling Inference over multiple strings 2011 dissertation by Markus Dreyer Organize corpus tokens into morphological paradigms Infer missing forms String and sequence modeling Optimal inference of strings 2011 dissertation by Markus Dreyer Organize corpus types into morphological paradigms Infer missing forms Cool model – but exact inference is intractable, even undecidable Dual decomposition to the rescue? Will allow MAP inference in such models Wasn’t obvious how to infer strings by dual decomposition Message passing algorithm If it converges, the answer is guaranteed correct We have one technique and are working on others So far, we’ve applied it to intersecting many automata E.g, exact consensus output of ASR or MT systems Usually converges reasonably quickly String and sequence modeling Optimal inference of strings O(100*n*g) per iteration Grammar Induction Finding the “best” grammar is a horrible optimization problem Even for overly simple definitions of “best” Two new attacks: Mathematical programming techniques Branch and bound + Dantzig-Wolfe decomposition over the sentences + Stochastic local search Deep learning “Inside” and “outside” strings should depend on each other only through a nonterminal (context-freeness) CCA should be able to find that nonterminal (spectral learning) But first need vector representations of inside and outside strings So use CCA to build up representations recursively (deep learning) Improved Topic Models Results improve on the state of the art What can we learn from distributional properties of words? Some words group together into “topics.” Tend to cooccur in documents; or have similar syntactic arguments. But are there further hidden variables governing this? Try to get closer to underlying meaning or discourse space. Future: Embed words or phonemes in a structured feature space whose structure must be learned Applied NLP Tasks Results improve on the state of the art Add more global features to the model … Need approximate inference, but it’s worth it Especially if we train for the approximate inference condition Within-document coreference Build up properties of the underlying entities Sentiment polarity Exploit cross-document references that signal (dis)agreement between two authors Multi-label text categorization Gender, number, animacy, semantic type, head word Exploit correlations between labels on same document Information extraction Exploit correlations between labels on faraway words Database generated websites Database back-end Post ID Author 520 Demon 521 Ushi Author Title Demon Moderator Ushi Pink Space Monkey Author Location Demon Pennsylvania Ushi Where else? (...) Web-page code produced by querying DB (...) 112 Website generated databases* Recovered database Post ID Author 520 Demon 521 Ushi Author Title Demon Moderator Ushi Pink Space Monkey Author Location Demon Pennsylvania Ushi Where else? * Thanks, Bayes! Given web pages We state a prior over annotated grammars And a prior over database schemas And a prior over database contents 113 Relational database Webpages Why isn’t this easy? Could write a custom script … … for every website in every language?? (and maintain it??) Why are database-backed websites important? 1. Vast amounts of useful information are published this way! (most?) 2. In 2007, Dark Web project @ U. Arizona estimated 50,000 extremist/terrorist websites; fastest growth was in Web 2.0 sites Some were transient sites, or subcommunities on larger sites 3. Our techniques could extend to analyze other semistructured docs Why are NLP methods relevant? Like NL, these webpages are meant to be read by humans But they’re a mixture of NL text, tables, semi-structured data, repetitive formatting … Harvest NL text + direct facts (including background facts for NLP) Helpful that HTML is a tree: we know about those 114 Shopping & auctions (with user comments) 115 News articles & blogs... 116 ...with user comments 117 Crime listings 118 Social networking 119 Collaborative encyclopedias 120 Linguistic resources (monolingual, bilingual) 121 Classified ads 122 Catalogs 123 Public records (in some countries) Real estate, car ownership, sex offenders, legal judgments, inmate data, death records, census data, map data, genealogy, elected officials, licensed professionals … http://www.publicrecordcenter.com 124 Public records (in some countries) 125 Directories of organizations (e.g., Yellow Pages) Banks of the World >> South Africa >> Union Bank of Nigeria PLC 126 Directories of people 127 Different types of structured fields Explicit fields Fields with internal structure Iterated field 128 Forums, bulletin boards, etc. 129 Lots of structured & unstructured content Author Date of post Title (moderator, member, ...) Post Geographic location of poster 130 Fast but Principled Reasoning to Analyze Data Principled: Fast prediction: New models suited to the data + new inference algorithms for those models = draw appropriate conclusions from data Inference algorithms + approximations trained to balance speed & acc. = 80% of the benefit at 20% of the cost Reusable frameworks for modeling & prediction ERMA Empirical Risk Minimization under Approximations Our pretty models are approximate Our inference procedures are also approximate Our decoding procedures are also approximate Our training procedures are also approximate (non-Bayesian) So let’s train to minimize loss in the presence of all these approximations Striking improvements on several real NLP tasks (as well as a range of synthetic data) Speed-Aware ERMA Empirical Risk Minimization under Approximations So let’s train to minimize loss in the presence of all these approximations Striking improvements on several real NLP tasks (as well as a range of synthetic data) Even better, let’s train to minimize loss + runtime Need special parameters to control degree of approximation How long to run? Which messages to pass? Which features to use? Get substantial speedups at little cost to accuracy Next extension: Probabilistic relational models Learn to do fast approximate probabilistic reasoning about slots and fillers in a knowledge base Detect interesting facts, answer queries, improve info extraction Generate plausible supervised training data – minimize imputed risk Learned Dynamic Prioritization More minimization of loss + runtime Many inference procedures take nondeterministic steps that refine current beliefs. Graphical models: Which message to update next? Parsing: Which constituent to extend next? Parsing: Which build action, or should we backtrack & revise? Should we prune, or narrow or widen the beam? Coreference: Which clusters to merge next? Learn a fast policy that decides what step to take next. “Compile” a slow inference procedure into a fast one that is tailored to the specific data distribution and task loss. Hard training problem in order to make test fast. We’re trying a bunch of different techniques. Compressed Learning Sublinear time How do we do unsupervised learning on many terabytes of data?? Can’t afford to do many passes over the dataset … Throw away some data? Might create bias. How do we know we’re not throwing away the important clues? Better: Summarize the less relevant data and try to learn from the summary. Google N-gram corpus = a compressed version of the English web. N-gram counts from 1 trillion words of text 135 Tagging isolated N-grams Topics: Computers Biology V V N IN DT NN NNS IN though most monitor lizards from IN NN VB NNS IN Oops, ambiguous. For learning, would help to have the whole sentence. 136 Tagging N-grams in context Topics: Computers Biology V V N IN DT NN NNS IN … some will eat vegetables though most monitor lizards from Africa are carnivores … Oops, ambiguous. For learning, would help to have the whole sentence. 137 Tagging N-grams in context Topics: Computers Biology … he watches them up close though most monitor lizards from a distance … IN NN VB NNS IN Oops, ambiguous. For learning, would help to have the whole sentence. 138 Extrapolating contexts … Topics: Computers Biology V V N some will Africa eat vegetables he DT NN NNS IN Asia are carnivorous though most monitor lizards from watches them IN carnivores close IN NN VB up watch though most monitor lizards from though most monitor lizards vegetables though most monitor lizards close though most monitor lizards NNS IN a distance N though most monitor lizards from most monitor lizards from most monitor lizards from Africa most monitor lizards from Asia most monitor lizards from a 139 Learning from N-grams Topics: Computers Biology V V N some will Africa eat vegetables he IN DT NN NNS IN carnivores are Asia carnivorous though most monitor lizards from watches close IN NN VB NNS IN a distance them up N watch 52 though most monitor lizards from 133 though most monitor lizards 101 vegetables though most monitor lizards 32 close though most monitor lizards 52 though most monitor lizards from most monitor lizards from 250 most monitor lizards from Africa 142 most monitor lizards from Asia 83 most monitor lizards from a 25 140 Fast but Principled Reasoning to Analyze Data Principled: Fast prediction: New models suited to the data + new inference algorithms for those models = draw appropriate conclusions from data Inference algorithms + approximations trained to balance speed & acc. = 80% of the benefit at 20% of the cost Reusable frameworks for modeling & prediction Dyna A language for propagating and combining information Each idea takes a lot of labor to code up. We spend way too much “research” time building the parts that we already knew how to build. Coding natural variants on existing models/algorithms Hacking together existing data sources and algorithms Extracting outputs Tuning data structures, file formats, computation order, parallelization What’s in a knowledge base? Types Observed facts Derived facts Inference rules (declarative) Inference strategies (procedural) Common architecture? There’s not a single best way to represent uncertainty or combine knowledge. What do numeric “weights” represent in a reasoning system? Probabilities (perhaps approximations or bounds) Intermediate quantities used to compute probabilities (in dynamic programming or message-passing) Feature values Potentials Feature weights & other parameters Priorities Distances, similarities Confidences Margins Activation levels Event or sample counts Regularization terms Losses, risks, rewards Partial derivatives ... Common architecture? There’s not a single best way to represent uncertainty or combine knowledge. Different underlying probabilistic models Different approximate inference/decision algorithms Depends on domain properties, special structure, speed needs … Heterogeneous data, features, rules, proposal distributions … Need ability to experiment, extend, and combine But all of the methods share the same computational needs. Common architecture? There’s not a single best way … But all of the methods share the same needs. Store data and permit it to be queried. Fuse data – compute derived data using rules. Propagate updates to data, parameters, or hypotheses. Encapsulate data sources – both input data & analytics. Sensitivity analysis (e.g., back-propagation for training). Visualization of facts, changes, and provenance. Common architecture? 2011 paper on encoding AI problems in Dyna: 2-3 lines: 4 lines: 11 lines: 6 lines: +6 lines: With backtracking search +6 lines: With branch-and-bound 6 lines: 3 lines: Dijkstra’s algorithm Feed-forward neural net Bigram language model (Good-Turing backoff smoothing) Arc-consistency constraint propagation Loopy belief propagation Probabilistic context-free parsing +7 lines: PCFG rule weights via feature templates (toy example) 4 lines: 5 lines: 3 lines: Value computation in a Markov Decision Process Weighted edit distance Markov chain Monte Carlo (toy example) Common architecture? There’s not a single best way … But all of the methods share the same needs. Store data and permit it to be queried. Fuse data – compute derived data using rules. Propagate updates to data, parameters, or hypotheses. Encapsulate data sources – both input data & analytics. Sensitivity analysis (e.g., back-propagation for training). Visualization. And benefit from the same optimizations. Decide what is worth the time to compute (next). Decide where to compute it (parallelism). Decide what is worth the space to store (data, memos, indices). Decide how to store it. Common architecture? Dyna is not a probabilistic database, a graphical model inference package, FACTORIE, BLOG, Watson, a homebrew evidence combination system, ... It provides the common infrastructure for these. That’s where “all” the implementation effort lies. But does not commit to any specific data model, probabilistic semantics, or inference strategy. Summary (again) Machine learning + linguistic structure. Fashion statistical models that capture good intuitions about various kinds of linguistic structure. Develop efficient algorithms to apply these models to data. Be generic. 1. 2. 3. Principled Bayesian models of various interesting NLP domains. Discover underlying structure with little supervision Requires new learning and inference algorithms Learn fast, accurate policies for structured prediction and large-scale relational reasoning. Unified computational infrastructure for NLP and AI. A declarative programming language that supports modularity Backed by a searchable space of strategies & data structures