Weighted Deduction as a Programming Language Jason Eisner co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Wes Filardo, Wren Thornton CMU and Google, May 2008 1 An Anecdote from ACL’05 -Michael Jordan 2 An Anecdote from ACL’05 -Michael Jordan Just draw a model that actually makes sense for your problem. Just do Gibbs sampling, Um, it’s only 6 lines in Matlab… if it’s fast enough. 3 Conclusions to draw from that talk 2. Mike & his students are great. Graphical models are great. 3. Gibbs sampling is great. 4. Matlab is great. 1. (because they’re flexible) (because it works with nearly any graphical model) (because it frees up Mike and his students to doodle all day and then execute their doodles) 4 2. Mike & his students are great. Graphical models are great. 3. Gibbs sampling is great. 4. Matlab is great. 1. (because they’re flexible) (because it works with nearly any graphical model) (because it frees up Mike and his students to doodle all day and then execute their doodles) 5 Systems are big! Large-scale noisy data, complex models, search approximations, software engineering NLP sys files code (lines) comments lang (primary) purpose SRILM 308 49879 14083 C++ LM LingPipe 502 49967 47515 Java LM/IE Charniak parser 259 53583 8057 C++ Parsing Stanford parser 373 121061 24486 Java Parsing GenPar 986 77922 12757 C++ Parsing/MT MOSES 305 42196 6946 Perl, C++,… MT GIZA++ 124 16116 2575 C++ MT alignment 6 Systems are big! Large-scale noisy data, complex models, search approximations, software engineering Maybe a bit smaller outside NLP But still big and carefully engineered And will get bigger, e.g., as machine vision systems do more scene analysis and compositional object modeling System files code comments lang purpose ProbCons 15 4442 693 C++ MSA of amino acid seqs MUSTANG 50 7620 3524 C++ MSA of protein structures MELISMA 44 7541 1785 C Dynagraph 218 20246 Music analysis 4505 C++ Graph layout 7 Systems are big! Large-scale noisy data, complex models, search approximations, software engineering Consequences: Barriers to entry Barriers to experimentation Small number of players Significant investment to be taken seriously Need to know & implement the standard tricks Too painful to tear up and reengineer your old system, to try a cute idea of unknown payoff Barriers to education and sharing Hard to study or combine systems Potentially general techniques are described and implemented only one context at a time 8 How to spend one’s life? Didn’t I just implement something like this last month? chart management / indexing cache-conscious data structures memory layout, file formats, integerization, … prioritization of partial solutions (best-first, A*) lazy k-best, forest reranking parameter management inside-outside formulas, gradients, … different algorithms for training and decoding conjugate gradient, annealing, ... parallelization I thought computers were supposed to automate drudgery 9 Solution Presumably, we ought to add another layer of abstraction. After all, this is CS. Hope to convince you that a substantive new layer exists. But what would it look like? What’s shared by many programs? 10 Can toolkits help? NLP tool files code comments lang HTK 111 88865 OpenFST 150 20502 1180 C++ Weighted FSTs TIBURON 53 13791 4353 Java Tree transducers 163 58475 5853 C++ Annotation of time series UIMA 1577 154547 110183 Java Unstructured-data mgmt GATE 1541 79128 NLTK 258 60661 9093 Python NLP algs (educational) libbow 122 42061 9198 C MALLET 559 73859 18525 Java 90 12584 3286 Java AGLIB GRMM 14429 C purpose 42848 Java HMM for ASR Text engineering mgmt IR, textcat, etc. CRFs and classification Graphical models add-on 11 Can toolkits help? Hmm, there are a lot of toolkits. And they’re big too. Plus, they don’t always cover what you want. Which is why people keep writing them. E.g., I love & use OpenFST and have learned lots from its implementation! But sometimes I also want ... automata with > 2 tapes infinite alphabets parameter training A* decoding automatic integerization automata defined “by policy” mixed sparse/dense implementation (per state) parallel execution hybrid models (90% finite-state) So what is common across toolkits? 12 The Dyna language A toolkit’s job is to abstract away the semantics, operations, and algorithms for a particular domain. In contrast, Dyna is domain-independent. (like MapReduce, Bigtable, etc.) Manages data & computations that you specify. Toolkits or applications can be built on top. 13 Warning Lots more beyond this talk See http://dyna.org read our papers download an earlier prototype sign up for updates by email wait for the totally revamped next version 14 A Quick Sketch of Dyna 15 How you build a system (“big picture” slide) cool model PCFG practical equations y (i, j ) z ( j, k ) x i, k pN x N y N z | N x 0i j k n ... pseudocode (execution order) for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1 … tuned C++ implementation (data structures, etc.) 16 How you build a system (“big picture” slide) cool model PCFG practical equations y (i, j ) z ( j, k ) x i, k pN x N y N z | N x 0i j k n ... Dyna language specifies these equations. Most programs just need to compute some pseudocode values from other values. (execution order) Any order is ok. tuned C++ for width from 2 to n Feed-forward! implementation for i from 0 to n-width Dynamic programming! (data structures, etc.) k = i+width Message passing! (including Gibbs) for j from i+1 to k-1 Must quickly figure … out what influences what. Compute Markov blanket Compute transitions in state machine 17 How you build a system (“big picture” slide) cool model PCFG practical equations y (i, j ) z ( j, k ) x i, k pN x N y N z | N x 0i j k n ... Dyna language specifies these equations. Most programs just need to compute some pseudocode values from other values. (execution order) Any order is ok. tuned C++ for width from 2 to n Some programs need to updateimplementation the outputs for i from 0also to n-width (data structures, etc.) if the inputs change: k = i+width spreadsheets, makefiles, email readers for j from i+1 to k-1 … algorithms dynamic graph EM and other iterative optimization Energy of a proposed configuation for MCMC 18 leave-one-out training of smoothing params How you build a system (“big picture” slide) cool model PCFG practical equations y (i, j ) z ( j, k ) x i, k pN x N y N z | N x 0i j k n ... Compilation strategies (we’ll come back to this) pseudocode (execution order) for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1 … tuned C++ implementation (data structures, etc.) 19 Writing equations in Dyna int a. a = b * c. a will be kept up to date if b or c changes. b += x. b += y. equivalent to b = x+y. b is a sum of two variables. Also kept up to date. c += z(1). a “pattern” c += z(2). c += z(N). the capitalized N c += z(3). matches anything c += z(“four”). c is a sum of all c += z(foo(bar,5)). nonzero z(…) values. At compile time, we don’t know how many! 20 More interesting use of patterns a = b * c. scalar multiplication a(I) = b(I) * c(I). pointwise multiplication a += b(I) * c(I). means a = sparse dot product of query & document ... + b(“yetis”)*c(“yetis”) + b(“zebra”)*c(“zebra”) dot product; could be sparse a(I,K) += b(I,J) * c(J,K). b(I)*c(I) I b(I,J)*c(J,K) J matrix multiplication; could be sparse J is free on the right-hand side, so we sum over it 21 Dyna vs. Prolog By now you may see what we’re up to! Prolog has Horn clauses: a(I,K) :- b(I,J) , c(J,K). Dyna has “Horn equations”: a(I,K) += b(I,J) * c(J,K). prove a value for it e.g., a real number, but could be any term Like Prolog: definition from other values b*c only has value when b and c do if no values enter into +=, then a gets no value Allow nested terms Syntactic sugar for lists, etc. Turing-complete Unlike Prolog: Charts, not backtracking! Compile efficient C++ classes Terms have values 22 Some connections and intellectual debts … Deductive parsing schemata (preferably weighted) Deductive databases (preferably with aggregation) Ramakrishnan, Zukowski, Freitag, Specht, Ross, Sagiv, … Query optimization Usually limited to decidable fragments, e.g., Datalog Theorem proving Goodman, Nederhof, Pereira, McAllester, Warren, Shieber, Schabes, Sikkel… Theorem provers, term rewriting, etc. Nonmonotonic reasoning Programming languages Increasing interest in resurrecting declarative and logic-based system specifications. Efficient Prologs (Mercury, XSB, …) Probabilistic programming languages (PRISM, IBAL …) Declarative networking (P2) XML processing languages (XTatic, CDuce) Functional logic programming (Curry, …) Self-adjusting computation, adaptive memoization (Acar et al.) 23 Example: CKY and Variations 24 The CKY inside algorithm in Dyna phrase(X,I,J) += rewrite(X,W) * word(W,I,J). phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). goal += phrase(“s”,0,sentence_length). using namespace cky; chart c; put in axioms (values not defined by the above program) theorem pops out c[rewrite(“s”,“np”,“vp”)] = 0.7; c[word(“Pierre”,0,1)] = 1; c[sentence_length] = 30; cin >> c; // get more axioms from stdin cout << c[goal]; // print total weight of all parses 25 Visual debugger: Browse the proof forest desired theorem ambiguity dead end shared substructure (dynamic programming) axioms 26 Visual debugger: Browse the proof forest ambiguity dead end shared substructure (dynamic programming) 27 Parameterization … phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). rewrite(X,Y,Z) doesn’t have to be an atomic parameter: urewrite(X,Y,Z) *= weight1(X,Y). urewrite(X,Y,Z) *= weight2(X,Z). urewrite(X,Y,Z) *= weight3(Y,Z). urewrite(X,Same,Same) *= weight4. urewrite(X) += urewrite(X,Y,Z). % normalizing constant rewrite(X,Y,Z) = urewrite(X,Y,Z) / urewrite(X). % normalize 28 Related algorithms in Dyna? phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Binarized CKY? Earley’s algorithm? 29 Related algorithms in Dyna? phrase(X,I,J) max= += rewrite(X,W) * word(W,I,J). phrase(X,I,J) max= += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). goal += phrase(“s”,0,sentence_length). max= Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Binarized CKY? Earley’s algorithm? 30 Related algorithms in Dyna? log+= phrase(X,I,J) max= += rewrite(X,W) +* word(W,I,J). phrase(X,I,J) log+= max= += rewrite(X,Y,Z) +* phrase(Y,I,Mid) *+phrase(Z,Mid,J). goal += phrase(“s”,0,sentence_length). max= log+= Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Binarized CKY? Earley’s algorithm? 31 Related algorithms in Dyna? phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). Viterbi parsing? Logarithmic domain? Lattice parsing? c[ word(“Pierre”, state(5)0, state(9)1) ] = 10.2 Incremental (left-to-right) parsing? Log-linear parsing? 8 9 Lexicalized or synchronous parsing? Binarized CKY? 5 Earley’s algorithm? 32 Related algorithms in Dyna? phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). Viterbi parsing? Logarithmic domain? Just add words one at a time to the chart Lattice parsing? Check at any time what can Incremental (left-to-right) parsing? be derived from words so far Log-linear parsing? Similarly, dynamic grammars Lexicalized or synchronous parsing? Binarized CKY? Earley’s algorithm? 33 Related algorithms in Dyna? phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Again, no change to the Dyna program Lexicalized or synchronous parsing? Binarized CKY? Earley’s algorithm? 34 Related algorithms in Dyna? phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Binarized CKY? Earley’s algorithm? Basically, just add extra arguments to the terms above 35 Related algorithms in Dyna? phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Binarized CKY? Earley’s algorithm? 36 Rule binarization phrase(X,I,J) += phrase(Y,I,Mid) * phrase(Z,Mid,J) * rewrite(X,Y,Z). folding transformation: asymp. speedup! temp(X\Y,Mid,J) += phrase(Z,Mid,J) * rewrite(X,Y,Z). phrase(X,I,J) += phrase(Y,I,Mid) * temp(X\Y,Mid,J). X Y Z Y I X\Y Z Mid Mid X Y J I Mid Mid J I J 37 Rule binarization phrase(X,I,J) += phrase(Y,I,Mid) * phrase(Z,Mid,J) * rewrite(X,Y,Z). folding transformation: asymp. speedup! temp(X\Y,Mid,J) += phrase(Z,Mid,J) * rewrite(X,Y,Z). phrase(X,I,J) += phrase(Y,I,Mid) * temp(X\Y,Mid,J). phrase(Y,I,Mid) * phrase(Z,Mid,J) * rewrite(X,Y,Z) Y , Z , Mid phrase(Y,I,Mid) Y , Mid Z graphical models constraint programming multi-way database join phrase(Z,Mid,J) * rewrite(X,Y,Z) 38 Program transformations cool model PCFG practical equations y (i, j ) z ( j, k ) x i, k 0i j k n pN x N y N z | N x Blatz & Eisner ...(FG 2007): Lots of equivalent ways to write a system of equations! pseudocode (execution order) tuned C++ Transforming fromfrom one2 to for width to nanother may implementation improve efficiency. for i from 0 to n-width (data structures, etc.) k = i+width j from to k-1 Many parsing “tricks”forcan bei+1 generalized into … other programs, too! automatic transformations that help 39 Related algorithms in Dyna? phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Binarized CKY? Earley’s algorithm? 40 Earley’s algorithm in Dyna phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). magic templates transformation (as noted by Minnen 1996) need(“s”,0) = true. need(Nonterm,J) :- phrase(_/[Nonterm|_],_,J). phrase(Nonterm/Needed,I,I) += need(Nonterm,I), rewrite(Nonterm,Needed). phrase(Nonterm/Needed,I,K) += phrase(Nonterm/[W|Needed],I,J) * word(W,J,K). phrase(Nonterm/Needed,I,K) += phrase(Nonterm/[X|Needed],I,J) * phrase(X/[],J,K). goal += phrase(“s”/[],0,sentence_length). 41 Related algorithms in Dyna? phrase(X,I,J) phrase(X,I,J) goal += rewrite(X,W) * word(W,I,J). += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). += phrase(“s”,0,sentence_length). Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Binarized CKY? Earley’s algorithm? Epsilon symbols? word(epsilon,I,I) = 1. (i.e., epsilons are freely available everywhere) 42 Some examples from my lab (as of 2006, Parsing using … w/prototype)… factored dependency models (Dreyer, Smith, & Smith CONLL’06) with annealed risk minimization (Smith and Eisner EMNLP’06) constraints on dependency length (Eisner & Smith IWPT’05) unsupervised learning of deep transformations (see Eisner EMNLP’02) lexicalized algorithms (see Eisner & Satta ACL’99, etc.) Grammar induction using … Programs are very short & easy to Machine translation using … change! Easy to try stuff out! partial supervision structural annealing contrastive estimation deterministic annealing (Dreyer & Eisner EMNLP’06) (Smith & Eisner ACL’06) (Smith & Eisner GIA’05) (Smith & Eisner ACL’04) Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06) Loosely syntax-based MT (Smith & Eisner in prep.) Synchronous cross-lingual parsing (Smith & Smith EMNLP’04) - see also Eisner ACL’03) Finite-state methods for morphology, phonology, IE, even syntax … Unsupervised cognate discovery (Schafer & Yarowsky ’05, ’06) Unsupervised log-linear models via contrastive estimation (Smith & Eisner ACL’05) Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05) Trainable (in)finite-state machines (see Eisner ACL’02, EMNLP’02, …) Finite-state machines with very large alphabets (see Eisner ACL’97) Finite-state machines over weird semirings (see Eisner ACL’02, EMNLP’03) 43 Teaching (Eisner JHU’05-06; Smith & Tromble JHU’04) Can it express everything in NLP? Remember, it integrates tightly with C++, so you only have to use it where it’s helpful, and write the rest in C++. Small is beautiful. Of course, it is Turing complete … 44 One Execution Strategy (forward chaining) 45 How you build a system (“big picture” slide) cool model PCFG practical equations y (i, j ) z ( j, k ) x i, k pN x N y N z | N x 0i j k n ... Propagate updates pseudocode from right-to-left (execution order) through the equations. for width from 2 to n a.k.a. for i from 0 touse n-width a “agenda algorithm” k = i+widthgeneral “forward chaining” for j from i+1 to k-1 method “bottom-up inference” … “semi-naïve bottom-up” tuned C++ implementation (data structures, etc.) 46 Bottom-up inference agenda of pending updates rules of program pp(I,K) += prep(I,J) s(I,K) +=* np(I,J) np(J,K) * vp(J,K) prep(I,3) pp(2,5) prep(2,3) s(3,9) s(3,7) vp(5,K) vp(5,9) np(3,5) vp(5,7) ? += 0.3 +=== 0.15 0.21 1.0 0.5 ? += 0.3 ==0.7 we updated np(3,5); what else must therefore change? no more matches np(3,5) prep(I,3) vp(5,K) to this query = 0.1+0.3 ? ? 0.4 If np(3,5) hadn’t been in the chart already, we would have added it. chart of derived items with current values 47 How you build a system (“big picture” slide) cool model PCFG practical equations y (i, j ) z ( j, k ) x i, k What’s going on under the hood? pN x N y N z | N x 0i j k n ... pseudocode (execution order) for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1 … tuned C++ implementation (data structures, etc.) 48 Compiler provides … agenda of pending updates efficient priority queue s(I,K) += np(I,J) * vp(J,K) np(3,5) copy, compare, & hash terms fast, via += 0.3 rules of program hard-coded pattern matching integerization (interning) automatic indexing for O(1) lookup vp(5,K)? chart of derived items with current values efficient storage of terms (given static type info) (implicit storage, “symbiotic” storage, various data structures, support for indices, stack vs. heap, …) 49 Beware double-counting! agenda of pending updates combining with itself rules of program n(I,K) += n(I,J) * n(J,K) n(5,5) n(5,5) to make n(5,5) = 0.2 += 0.3 another copy += ? of itself epsilon constituent n(5,K)? chart of derived items with current values 50 More issues in implementing inference Handling non-distributive updates Replacement what if q(0) becomes unprovable (no value)? p += 1/q(X). adding Δ to q(0) doesn’t simply add to p Backpointers (hyperedges in the derivation forest) p max= q(X). Non-distributive rules what if current max q(0) is reduced? Retraction p max= q(X). Efficient storage, or on-demand recomputation Information flow between f(3), f(int X), f(X) 51 More issues in implementing inference User-defined priorities priority(phrase(X,I,J)) = -(J-I). CKY (narrow to wide) priority(phrase(X,I,J)) = phrase(X,I,J). uniform-cost + heuristic(X,I,J) A* Can we learn a good priority function? (can be dynamic) User-defined parallelization host(phrase(X,I,J)) = J. Can we learn a host choosing function? (can be dynamic) User-defined convergence tests 52 More issues in implementing inference Time-space tradeoffs Which queries to index, and how? Selective or temporary memoization Can we learn a policy? On-demand computation (backward chaining) Prioritizing subgoals; query planning Safely invalidating memos` Mixing forward-chaining and backward-chaining Can we choose a good mixed strategy? 53 Parameter training Maximize some objective function. Use Dyna to compute the function. Then how do you differentiate it? … for gradient ascent, conjugate gradient, etc. … gradient of log-partition function also tells us the expected counts for EM objective function as a theorem’s value e.g., inside algorithm computes likelihood of the sentence Two approaches supported: model parameters (andand input sentence) Tape algorithm – remember agenda order run it “backwards.” as axiom values Program transformation – automatically derive the “outside” formulas. 54 Automatic differentiation via the gradient transform a += b * c. g(b) += a * g(c). g(a) += g(b) * c. Now g(x) denotes ∂f/∂x, f being the objective func. Examples: Dyna implementation also supports “tape”based differentiation. Backprop for neural networks Backward algorithm for HMMs and CRFs Outside algorithm for PCFGs 55 More on Program Transformations 56 Program transformations An optimizing compiler would like the freedom to radically rearrange your code. Easier in a declarative language than in C. Don’t need to reconstruct the source program’s intended semantics. Also, source program is much shorter. Search problem (open): Find a good sequence of transformations (on a given workload). 57 Variable elimination via a folding transform Undirected graphical model: = goal max= f1(A,B)*f2(A,C)*f3(A,D)*f4(C,E)*f5(D,E). tempE(C,D) tempE(C,D) max= f4(C,E)*f5(D,E). to eliminate E, join constraints mentioning E, and project E out 59 figure thanks to Rina Dechter Variable elimination via a folding transform Undirected graphical model: = goal max= f1(A,B)*f2(A,C)*f3(A,D)*tempE(C,D). tempD(A,C) tempD(A,C) max= f3(A,D)*tempE(C,D). to eliminate D, join constraints mentioning D, tempE(C,D) max= f4(C,E)*f5(D,E). and project D out 60 figure thanks to Rina Dechter Variable elimination via a folding transform Undirected graphical model: = = goal max= f1(A,B)*f2(A,C)*tempD(A,C). tempC(A) tempC(A) max= f2(A,C)*tempD(A,C). tempD(A,C) max= f3(A,D)*tempE(C,D). tempE(C,D) max= f4(C,E)*f5(D,E). 61 figure thanks to Rina Dechter Variable elimination via a folding transform Undirected graphical model: = = goal max= tempC(A)*f1(A,B). tempB(A) tempB(A) max= f1(A,B). tempC(A) max= f2(A,C)*tempD(A,C). tempD(A,C) max= f3(A,D)*tempE(C,D). tempE(C,D) max= f4(C,E)*f5(D,E). 62 figure thanks to Rina Dechter Variable elimination via a folding transform Undirected graphical model: = = goal max= tempC(A)*tempB(A). could replace max= with += throughout, tempB(A) max= f1(A,B). tempC(A) max= f2(A,C)*tempD(A,C). to compute partition function Z tempD(A,C) max= f3(A,D)*tempE(C,D). tempE(C,D) max= f4(C,E)*f5(D,E). 63 figure thanks to Rina Dechter Grammar specialization as an unfolding transform phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J). rewrite(“s”,“np”,“vp”) += 0.7. unfolding phrase(“s”,I,J) += 0.7 * phrase(“np”,I,Mid) * phrase(“vp”,Mid,J). term flattening s(I,J) += 0.7 * np(I,Mid) * vp(Mid,J). (actually handled implicitly by subtype storage declarations) 64 On-demand computation via a “magic templates” transform a :- b, c. Examples: a :- magic(a), b, c. magic(b) :- magic(a). magic(c) :- magic(a), b. Earley’s algorithm for parsing Left-corner filter for parsing On-the-fly composition of FSTs The weighted generalization turns out to be the “generalized A*” algorithm (coarse-to-fine search). 65 Speculation transformation (generalization of folding) Perform some portion of computation speculatively, before we have all the inputs we need Fill those inputs in later Examples from parsing: Gap passing in categorial grammar Transform a parser so that it preprocesses the grammar Build an S/NP (a sentence missing its direct object NP) E.g., unary rule closure or epsilon closure Build phrase(“np”,I,K) from a phrase(“s”,I,K) we don’t have yet (so we haven’t yet chosen a particular I, K) Transform lexical context-free parsing from O(n5) O(n3) Add left children to a constituent we don’t have yet (without committing to its width) Derive Eisner & Satta (1999) algorithm 66 A few more language details So you’ll understand the examples … 67 Terms (generalized from Prolog) These are the “Objects” of the language Primitives: Variables: 3, 3.14159, “myUnicodeString” user-defined primitive types X int X [type-restricted variable; types are tree automata] Compound terms: atom atom(subterm1, subterm2, …) e.g., f(g(h(3),X,Y), Y) Adding support for keyword arguments (similar to R, but must support unification) 68 Fixpoint semantics A Dyna program is a finite rule set that defines a partial function (“map”) ground terms (variable-free) values (also terms) weight(feature(word=“pet”,tag=“Noun”)) northeast(point(3,6)) mother(“Eve”) 2.34 point(4,7) (not defined) Map Map only defines values for ground terms Variables (X,Y,…) let us define values for ∞ly many ground terms Compute a map that satisfies the equations in the program Not guaranteed to halt (Dyna is Turing-complete, unlike Datalog) Not guaranteed to be unique 69 Fixpoint semantics A Dyna program is a finite rule set that defines a partial function (“map”) ground terms (variable-free) values (also terms) weight(feature(word=“pet”,tag=“Noun”)) northeast(point(3,6)) mother(“Eve”) 2.34 point(4,7) (not defined) Map Map only defines values for ground terms Map may accept modifications at runtime Runtime input Adjustments to input (dynamic algorithms) Retraction (remove input), detachment (forget input but preserve output) 70 “Object-oriented” features Maps are terms, i.e., first-class objects weight(feature(word=“pet”,tag=“Noun”)) northeast(point(3,6)) mother(“Eve”) Maps can appear as subterms or as values Useful for encapsulating data and passing it around: fst3 = compose(fst1, fst2). % value of fst3 is a chart forest = parse(sentence). Typed by their public interface: Map 2.34 point(4,7) (not defined) fst4->edge(Q,R) += fst3->edge(R,Q). Maps can be stored in files and loaded from files Human-readable format (looks like a Dyna program) Binary format (mimics in-memory layout) 71 Functional features: Auto-evaluation Terms can have values. So by default, subterms are evaluated in place. Arranged by a simple desugaring transformation: foo( X ) += 3*bar(X). 2 things to evaluate here: bar and * foo( X ) += B is bar(X), Result is 3*B, Result. each “is” pattern-matches against the chart (which conceptually contains pairs such as 49 is bar(7)) Possible to suppress evaluation &f(x) or force it *f(x) Some contexts also suppress evaluation. Variables are replaced with their bindings but not otherwise evaluated. 72 Functional features: Auto-evaluation Terms can have values. So by default, subterms are evaluated in place. Arranged by a simple desugaring transformation: foo(f(X)) += 3*bar(g(X)). foo( F ) += F is f(X), G is g(X), B is bar(G), Result is 3*B, Result. Possible to suppress evaluation &f(x) or force it *f(x) Some contexts also suppress evaluation. Variables are replaced with their bindings but not otherwise evaluated. 73 Other handy features Guard condition on a rule: If X is true, then X,Y has value Y. Otherwise X,Y is not provable. fact(0) = 1. fact(int N) = N > 0, N*fact(N-1). Restricts applicability of this rule. (Note: There’s a strong type system, but it’s optional. Use it as desired for safety and efficiency, and to control the implementation.) Degenerate aggregator. Like +=, but it’s an error if it tries to aggregate more than one value. 0! = 1. user-defined syntactic sugar (int N)! = N*(N-1)! if N ≥ 1. Unicode 74 Aggregation operators f(X) = 3. f(X) += 3. f(X) min= 3. f(X) := 3. by further += rules at compile-time or runtime % immutable % can be incremented later % can be reduced later % can be arbitrarily changed later Pseudo-aggregator: Later values (in source code) override earlier ones. Can regard this as a true aggregator on (source line, value) pairs. f(X) => 3. % like = but can be overridden by more specific rule 75 Aggregation operators f(X) := 1. % can be arbitrarily changed later Pseudo-aggregator: Later values override earlier ones. Non-monotonic reasoning: flies(bird X) := true. flies(bird X) := penguin(X), false. % overrides flies(bigbird) := false. % also overrides Iterative update algorithms (EM, Gibbs, BP): a := init_a. a := updated_a(b). % will override once b is proved b := updated_b(a). 76 Declarations (ultimately, should be chosen automatically) at term level lazy vs. eager computational strategies memoization and flushing strategies prioritization, parallelization, etc. at class level class = an implementation of a type type = some subset of the term universe class specifies storage strategy classes may implement overlapping types 77 Frozen variables Dyna map semantics concerns ground terms. But want to be able to reason about non-ground terms, too. Manipulate Dyna rules (which are non-ground terms) Work with classes of ground terms (specified by non-ground terms) Queries, memoized queries … Memoization, updating, prioritization of updates, … So, allow ground terms that contain “frozen variables” Treatment under unification is beyond scope of this talk $priority(f(X)) = f(X). $priority(#f(X)) = infinity. % for each X % frozen non-ground term 78 Gensyms 79 Some More Examples Shortest paths n-gram smoothing Neural nets Arc consistency Vector-space IR Game trees FST composition Edit distance Generalized A* parsing 80 Path-finding in Prolog 1 pathto(1). % the start of all paths pathto(V) :- edge(U,V), pathto(U). When is the query pathto(14) really inefficient? 2 5 8 11 3 6 9 12 4 7 10 13 14 What’s wrong with this swapped version? pathto(V) :- pathto(U), edge(U,V). 81 Shortest paths in Dyna Single source: pathto(start) min= 0. pathto(W) min= pathto(V) + edge(V,W). All pairs: can change min= to += to sum over paths (e.g., PageRank) path(U,U) min= 0. path(U,W) min= path(U,V) + edge(V,W). A* This hint gives Dijkstra’s algorithm (pqueue): $priority(pathto(V) min= Delta) = Delta.+ heuristic(V). Must also declare that pathto(V) has converged as soon as it pops off the priority queue; this is true if heuristic is admissible. 82 Neural networks in Dyna y h1 out(Node) = sigmoid(in(Node)). sigmoid(X) = 1/(1+exp(-X)). x1 h3 h2 x2 y' x3 x4 in(Node) += weight(Node,Previous)*out(Previous). in(Node) += input(Node). error += (out(Node)-target(Node))**2. only defined for a few nodes Recurrent neural net is ok 83 Vector-space IR in Dyna bestscore(Query) max= score(Query,Doc). score(Query,Doc) += tf(Query,Word)*tf(Doc,Word)*idf(Word). idf(Word) = 1/log(df(Word)). df(Word) += 1 whenever tf(Doc,Word) > 0. 84 Weighted FST composition in Dyna (epsilon-free case) start(A o B) = start(A) x start(B). stop(A o B, Q x R) += stop (A, Q) & stop (B, R). arc(A o B, Q1 x R1, Q2 x R2, In, Out) += arc(A, Q1, Q2, In, Match) * arc(B, R1, R2, Match, Out). Computes full cross-product. Use magic templates transform to build only reachable states. 85 n-gram smoothing in Dyna These values all update automatically during leave-one-out jackknifing. mle_prob(X,Y,Z) = count(X,Y,Z)/count(X,Y). smoothed_prob(X,Y,Z) = λ*mle_prob(X,Y,Z) + (1-λ)*mle_prob(Y,Z). for arbitrary-length contexts, could use lists count_of_count(X,Y,count(X,Y,Z)) += 1. Used for Good-Turing and Kneser-Ney smoothing. E.g., count_of_count(“the”, “big”, 1) is number of word types that appeared exactly once after “the big.” 86 Arc consistency (= 2-consistency) Agenda algorithm … X:3 has no support in Y, so kill it off Y:1 has no support in X, so kill it off Z:1 just lost its only support in Y, so kill it off X X, Y, Z, T :: 1..3 X # Y Y #= Z T # Z X #< T 1, 2, 3 Y 1, 2, 3 Note: These steps can occur in somewhat arbitrary order 1, 2, 3 T = 1, 2, 3 Z 87 slide thanks to Rina Dechter (modified) Arc consistency in Dyna (AC-4 algorithm) Axioms (alternatively, could define them by rule): indomain(Var:Val) := … % define some values true consistent(Var:Val, Var2:Val2) := … For Var:Val to be kept, Val must be in-domain and also not ruled out by any Var2 that cares: Define to be true or false if Var, Var2 are co-constrained. Otherwise, leave undefined (or define as true). possible(Var:Val) &= indomain(Var:Val). possible(Var:Val) &= supported(Var:Val, Var2). Var2 cares if it’s co-constrained with Var:Val: supported(Var:Val, Var2) |= consistent(Var:Val, Var2:Val2) & possible(Var2:Val2). 88 Propagating bounds consistency in Dyna E.g., suppose we have a constraint A #<= B (as well as other constraints on A). Then maxval(a) min= maxval(b). % if B’s max is reduced, then A’s should be too minval(b) max= minval(a). % by symmetry Similarly, if C+D #= 10, then maxval(c) min= 10-minval(d). maxval(d) min= 10-minval(c). minval(c) max= 10-maxval(d). minval(d) max= 10-maxval(c). 89 Game-tree analysis All values represent total advantage to player 1 starting at this board. % how good is Board for player 1, if it’s player 1’s move? best(Board) max= stop(player1, Board). best(Board) max= move(player1, Board, NewBoard) + worst(NewBoard). % how good is Board for player 1, if it’s player 2’s move? (player 2 is trying to make player 1 lose: zero-sum game) worst(Board) min= stop(player2, Board). worst(Board) min= move(player2, Board, NewBoard) + best(NewBoard). % How good for player 1 is the starting board? goal = best(Board) if start(Board). 90 Edit distance between two strings 4 edits clara caca Traditional picture clara 3 edits caca 2 clara c aca 3 cla ra c ac a 9 clara caca 91 Edit distance in Dyna dist([], []) = 0. dist([X|Xs],Ys) min= dist(Xs,Ys) + delcost(X). dist(Xs,[Y|Ys]) min= dist(Xs,Ys) + inscost(Y). dist([X|Xs],[Y|Ys]) min= dist(Xs,Ys) + substcost(X,Y). substcost(L,L) = 0. result = align([“c”, “l”, “a”, “r”, “a”], [“c”, “a”, “c”, “a”]). 92 Edit distance in Dyna on input lattices dist(S,T) min= dist(S,T,Q,R) + Sfinal(Q) + Tfinal(R). dist(S,T, S->start, T->start) min= 0. dist(S,T, I2, J) min= dist(S,T, I, J) + Sarc(I,I2,X) + delcost(X). dist(S,T, I, J2) min= dist(S,T, I, J) + Tarc(J,J2,Y) + inscost(Y). dist(S,T, I2,J2) min= dist(S,T, I, J) + Sarc(I,I2,X) + Sarc(J,J2,Y) + substcost(X,Y). substcost(L,L) = 0. result = dist(lattice1, lattice2). lattice1 = { start=state(0). arc(state(0),state(1),“c”)=0.3. arc(state(1),state(2),“l”)=0. … final(state(5)). } 93 Generalized A* parsing (CKY) % Get Viterbi outside probabilities. % Isomorphic to automatic differentiation (reverse mode). outside(goal) = 1. outside(Body) max= outside(Head) whenever $rule(Head max= Body). outside(phrase B) max= (*phrase A) * outside(&(A*B)). outside(phrase A) max= outside(&(A*B)) * (*phrase B). % Prioritize by outside estimates from coarsened grammar. $priority(phrase P) = (*P) * outside(coarsen(P)). $priority(phrase P) = 1 if P==coarsen(P). % can't coarsen any further 94 Generalized A* parsing (CKY) % coarsen nonterminals. coa("PluralNoun") = "Noun". coa("Noun") = "Anything". coa("Anything") = "Anything". … % coarsen phrases. coarsen(&phrase(X,I,J)) = &phrase(coa(X),I,J). % make successively coarser grammars % each is an admissible estimate for the next-finer one. coarsen(rewrite(X,Y,Z)) = rewrite(coa(X),coa(Y),coa(Z)). coarsen(rewrite(X,Word)) = rewrite(coa(X),Word). *coarsen(Rule) max= Rule. i.e., Coarse max= Rule whenever Coarse=coarsen(Rule). 95 Lightweight information interchange? Easy for Dyna terms to represent: XML data (Dyna types are analogous to DTDs) RDF triples (semantic web) Annotated corpora Ontologies Graphs, automata, social networks Also provides facilities missing from semantic web: Queries against these data State generalizations (rules, defaults) using variables Aggregate data and draw conclusions Keep track of provenance (backpointers) Keep track of confidence (weights) Map = deductive database in a box Like a spreadsheet, but more powerful, safer to maintain, and can communicate with outside world 96 How fast was the prototype version? It used “one size fits all” strategies Asymptotically optimal, but: 4 times slower than Mark Johnson’s inside-outside 4-11 times slower than Klein & Manning’s Viterbi parser 5-6x speedup not too hard to get 97 Are you going to make it faster? (yup!) Static analysis Mixed storage strategies Mixed inference strategies store X in an array store Y in a hash don’t store Z (compute on demand) Choose strategies by User declarations Automatically by execution profiling 98 Summary AI systems are too hard to write and modify. Dyna is a language for computation (no I/O) Need a new layer of abstraction. Simple, powerful idea: Define values from other values by weighted logic. Produces classes that interface with C++, etc. Compiler supports many implementation strategies Tries to abstract and generalize many tricks Fitting a strategy to the workload is a great opportunity for learning! Natural fit to fine-grained parallelization Natural fit to web services 99 Dyna contributors! Prototype (available): All-new version (under development): Eric Goldlust (core compiler), Noah A. Smith (parameter training), Markus Dreyer (front-end processing), David A. Smith, Roy Tromble, Asheesh Laroia Nathaniel Filardo (core compiler), Wren Ng Thornton (core compiler), Jay Van Der Wall (source language parser), John Blatz (transformations and inference), Johnny Graettinger (early design), Eric Northup (early design) Dynasty hypergraph browser (usable): Michael Kornbluh (initial version), Gordon Woodhull (graph layout), Samuel Huang (latest version), George Shafer, Raymond Buse, Constantinos Michael 100 FIN 101 New examples of dynamic programming in NLP Parameterized finite-state machines 111 Parameterized FSMs An FSM whose arc probabilities depend on parameters: they are formulas. a/q /p /1-p a/r b/(1-q)r 1-s a/q*exp(t+u) 112 Parameterized FSMs An FSM whose arc probabilities depend on parameters: they are formulas. a/.2 /.1 /.9 a/.3 b/.8 .7 a/.44 113 Parameterized FSMs An FSM whose arc probabilities depend on parameters: they are formulas. a/q /p /1-p a/r b/(1-q)r a/q*exp(t+u) 1-s Expert first: Construct the FSM (topology & parameterization). Automatic takes over: Given training data, find parameter values that optimize arc probs. 114 Parameterized FSMs Knight & Graehl 1997 - transliteration “/t/ and /d/ are similar …” p(English text) o p(English text English phonemes) o p(English phonemes Japanese phonemes) o p(Japanese phonemes Japanese text) Loosely coupled probabilities: /t/:/tt/ exp p+q+r (coronal, stop, unvoiced) /d/:/dd/ exp p+q+s (coronal, stop, voiced) 115 Parameterized FSMs p(English text) o p(English text English phonemes) o p(English phonemes Japanese phonemes) o p(Japanese phonemes Japanese text) Knight & Graehl 1997 - transliteration “Would like to get some of that expert knowledge in here” Use probabilistic regexps like (a*.7 b) +.5 (ab*.6) … If the probabilities are variables (a*x b) +y (ab*z) … then arc weights of the compiled machine are nasty formulas. (Especially after minimization!) 116 New examples of dynamic programming in NLP Parameterized infinite-state machines 125 Universal grammar as a parameterized FSA over an infinite state space 126 New examples of dynamic programming in NLP More abuses of finite-state machines 127 Huge-alphabet FSAs for OT phonology etc. Gen proposes all candidates that include this input. voi C V C C voi voi C V C C C V C C V C V C C C C C velar C } surface tiers voi voi V underlying tiers C C C V C C voi C V C C 128 Huge-alphabet FSAs for OT phonology encode this candidate as a string voi C V C C velar V C C C C C C at each moment, need to describe what’s going on on many tiers 129 Directional Best Paths construction Keep “best” output string for each input string Yields a new transducer (size 3n) For input abc: abc axc For input abd: axd 1 a:a b:b c:c 3 2 c:c b:x 4 6 Must allow red arc just if next input is d 5 d:d 7 130 Minimization of semiring-weighted FSAs New definition of for pushing: (q) = weight of the shortest path from q, breaking ties alphabetically on input symbols Computation is simple, well-defined, independent of (K, ) Breadth-first search back from final states: q b a:k a c distance 2 Compute (q) in O(1) time as soon as we visit q. Whole alg. is linear. b b c r (q) = k (r) d Faster than finding min-weight path à la Mohri. 131 New examples of dynamic programming in NLP Tree-to-tree alignment 132 Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. donnent (“give”) kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) Sam Sam kids often quite d’ (“of”) enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often” Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam Sam NP Adv kids null null NP Adv often quite NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often” Synchronous Tree Substitution Grammar Two training trees, showing a free translation from French to English. A possible alignment is shown in orange. Alignment shows how trees are generated synchronously from “little trees” ... donnent (“give”) Start kiss à (“to”) baiser (“kiss”) un (“a”) beaucoup (“lots”) d’ (“of”) Sam Sam NP Adv kids null null NP Adv often quite NP enfants (“kids”) “beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often” New examples of dynamic programming in NLP Bilexical parsing in O(n3) (with Giorgio Satta) 136 Lexicalized CKY [ Mary ] loves [ [ the] girl [ outdoors ] ] 137 Lexicalized CKY is ... advocate ... hug B i h 5 O(n ) not 3 O(n ) visiting relatives visiting relatives C j j+1 h’ k O(n3 combinations) A i h O(n5 combinations) k 138 Idea #1 Combine B with what C? B i h C j i must try differently-headed C’s (vary h’) Separate these! j+1 h’ k A must try different-width C’s (vary k) h’ k 139 Idea #1 B C h’ (the old CKY way) B i h i h j A C j j+1 h’ k C i j h’ k j+1 h’ k A A i h’ C i h’ k 140 Idea #2 Some grammars allow A A i h h k A i k 141 Idea #2 Combine what B and C? B i h C j i must try different midpoints j Separate these! j+1 h’ k A must try different-width C’s (vary k) h’ k 142 Idea #2 B (the old CKY way) B i h i h j+1 h’ k j A C j C i h j+1 h’ C h’ k h’ k A A i h C i h k 143 Idea #2 B (the old CKY way) B i h C j j+1 h’ k A i h k C h j A h j+1 h’ C h’ C h’ k A h k 144 An 3 O(n ) [ Mary ] algorithm (with G. Satta) loves [ [ the] girl [ outdoors ] ] 145 3 parsers: log-log plot 100000 10000 exhaustive 1000 Time pruned NAIVE IWPT-97 ACL-99 NAIVE 100 IWPT-97 ACL-99 10 1 10 100 Sentence Length 146 New examples of dynamic programming in NLP O(n)-time partial parsing by limiting dependency length (with Noah A. Smith) 147 Short-Dependency Preference A word’s dependents (adjuncts, arguments) tend to fall near it in the string. length of a dependency ≈ surface distance 3 1 1 1 fraction of all dependencies 0.6 50% of English dependencies have length 1, another 20% have length 2, 10% have length 3 ... 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0 0.1 length 1 10 100 1000 0 0 50 English 100 150 Chinese 200 German 250 Related Ideas • Score parses based on what’s between a head and child (Collins, 1997; Zeman, 2004; McDonald et al., 2005) • Assume short → faster human processing (Church, 1980; Gibson, 1998) • “Attach low” heuristic for PPs (English) (Frazier, 1979; Hobbs and Bear, 1990) • Obligatory and optional re-orderings (English) (see paper) Going to Extremes Longer dependencies are less likely. 0.6 What if we eliminate them completely? 0.5 0.4 0.3 0.2 0.1 0 1 10 English 100 Chinese 1000 German Hard Constraints Disallow dependencies between words of distance > b ... Risk: best parse contrived, or no parse at all! Solution: allow fragments (partial parsing; Hindle, 1990 inter alia). Why not model the sequence of fragments? Building a Vine SBG Parser Grammar: generates sequence of trees from $ Parser: recognizes sequences of trees without long dependencies Need to modify training data so the model is consistent with the parser. $ 8 According 1 4 1 , changes 2 to 2 the estimates 9 would 1 rule 1 . cut 3 2 filings 1 insider by 1 more 1 some 1 than 2 third (from the Penn Treebank) a 1 $ would 4 According 1 1 , changes 2 to 2 the 1 rule estimates 1 . cut 3 2 filings 1 insider by 1 more 1 some 1 b=4 than 2 third (from the Penn Treebank) a 1 $ would According 1 1 , changes 2 to 2 the 1 rule estimates 1 . cut 3 2 filings 1 insider by 1 more 1 some 1 b=3 than 2 third (from the Penn Treebank) a 1 $ would According 1 1 , changes 2 to 2 the 1 rule estimates 1 . cut 2 filings 1 insider by 1 more 1 some 1 b=2 than 2 third (from the Penn Treebank) a 1 $ would According 1 1 , changes to 1 . cut 1 the rule estimates filings 1 insider by 1 more 1 some 1 b=1 than third (from the Penn Treebank) a 1 $ would According . , changes to the filings by rule estimates some cut insider more b=0 than third (from the Penn Treebank) a Vine Grammar is Regular • Even for small b, “bunches” can grow to arbitrary size: • But arbitrary center embedding is out: Limiting dependency length Linear-time partial parsing: Finite-state model of root sequence NP S NP Bounded dependency length within each chunk (but chunk could be arbitrarily wide: right- or left- branching) Natural-language dependencies tend to be short So even if you don’t have enough data to model what the heads are … … you might want to keep track of where they are. 164 Limiting dependency length Linear-time partial parsing: Finite-state model of root sequence NP S Don’t convert into an FSA! NP Bounded dependency length within each chunk (but chunk could be arbitrarily wide: right- or left- branching) Less structure sharing Explosion of states for different stack configurations Hard to get your parse back 165 Limiting dependency length Linear-time partial parsing: NP one dependency S NP Each piece is at most k words wide No dependencies between pieces Finite state model of sequence Linear time! O(k2n) 166 Limiting dependency length Linear-time partial parsing: NP S NP Each piece is at most k words wide No dependencies between pieces Finite state model of sequence Linear time! O(k2n) 167 Parsing Algorithm • Same grammar constant as Eisner and Satta (1999) • O(n3) → O(nb2) runtime • Includes some overhead (low-order term) for constructing the vine – Reality check ... is it worth it? F-measure & runtime of a limited-dependency-length parser (POS seqs) 171 Precision & recall of a limited-dependency-length parser (POS seqs) 172 New examples of dynamic programming in NLP Grammar induction by initially limiting dependency length (with Noah A. Smith) 187 Soft bias toward short dependencies Multiply parse probability by exp -δS where S is the total length of all dependencies Then renormalize probabilities MLE baseline -∞ linear structure preferred δ=0 +∞ 189 Structural Annealing MLE baseline δ=0 -∞ +∞ Repeat ... Increase δ and retrain. Start here; train a model. Until performance stops improving on a small validation dataset. 190 20 German 50.3 63.4 70.0 English 41.6 57.4 61.8 Bulgarian 45.6 40.5 58.4 Mandarin 50.1 41.1 56.4 Turkish 48.0 58.2 62.4 Portuguese 42.3 71.8 50.4 30 40 50 60 70 Grammar Induction Other structural biases MLE can be annealed. CE: Deletions & Transpositions Structural Annealing We tried annealing on connectivity (# of fragments), and got similar results. 191 Treebank: the gene 6 A /9-Accurate Parse These errors thus can look like ones made by a prevent supervised plant from fertilizing parser in 2000! a itself MLE with locality bias: verb instead of modal as root the preposition misattachment prevent gene thus can plant a from fertilizing itself misattachment of adverb “thus” 192 Accuracy Improvements random Klein & language Manning tree (2004) German Smith & Eisner (2006) state-ofthe-art, supervised 27.5% 50.3 70.0 82.61 English 30.3 41.6 61.8 90.92 Bulgarian 30.4 45.6 58.4 85.91 Mandarin 22.6 50.1 57.2 84.61 Turkish 29.8 48.0 62.4 69.61 Portuguese 30.6 42.3 71.8 86.51 1CoNLL-X shared task, best system. 2McDonald 193 et al., 2005 Combining with Contrastive Estimation This generally gives us our best results … 194 New examples of dynamic programming in NLP Contrastive estimation for HMM and grammar induction Uses lattice parsing … (with Noah A. Smith) 195 Contrastive Estimation: Training Log-Linear Models on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu Contrastive Estimation: (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu Nutshell Version unannotated text tractable training contrastive estimation with lattice neighborhoods Experiments on unlabeled data: “max ent” features sequence models POS tagging: 46% error rate reduction (relative to EM) “Max ent” features make it possible to survive damage to tag dictionary Dependency parsing: 21% attachment error reduction (relative to EM) “Red leaves don’t hide blue jays.” Maximum Likelihood Estimation (Supervised) p JJ NNS MD VB JJ NNS y red leaves don’t hide blue jays x p Σ* × Λ* ? ? * Maximum Likelihood Estimation (Unsupervised) p ? ? ? ? ? ? red leaves don’t hide blue jays This is what EM does. Σ* × Λ* p ? ? * x Focusing Probability Mass numerator denominator Conditional Estimation (Supervised) p p JJ NNS MD VB JJ NNS y red leaves don’t hide blue jays x ? ? ? ? ? ? red leaves hide blue jays A different don’t denominator! (x) × Λ* Objective Functions Objective MLE MLE with hidden Contrastive variables Estimation Conditional Likelihood Perceptron Optimization Algorithm Count & Normalize* generic numerical EM* solvers (inIterative this talk, LMVM Scaling L-BFGS) Backprop Numerator Denominator tags & words observed data Σ* × Λ* (in words this talk, raw word sequence, tags words sum&over all possible taggings) tags & words Σ* × Λ* ? (words) × Λ* hypothesized tags & words *For generative models. This talk is about denominators ... in the unsupervised case. A good denominator can improve accuracy and tractability. Language Learning (Syntax) red leaves don’t hide blue jays At last! My own language learning device! EM didn’t hesequence say, Why didWhy he pick that for those words? granola” “birds fly” or “dancing “the dishes” or or Whyornot saywash “leaves red ...” hidesequence don’t ...”oforwords? ... any“... other What is a syntax model supposed to explain? Each learning hypothesis corresponds to a denominator / neighborhood. The Job of Syntax “Explain why each word is necessary.” → DEL1WORD neighborhood red don’t hide blue jays leaves don’t hide blue jays red leaves hide blue jays red leaves don’t hide blue jays red leaves don’t hide blue red leaves don’t blue jays red leaves don’t hide jays The Job of Syntax “Explain the (local) order of the words.” → TRANS1 neighborhood red don’t leaves hide blue jays leaves red don’t hide blue jays red leaves don’t hide blue jays red leaves hide don’t blue jays red leaves don’t hide jays blue red leaves don’t blue hide jays p p ? ? ? ? ? ? red leaves don’t hide blue jays ? ? ? ? ? ? red leaves don’t hide blue jays ? ? ? ? ? ? leaves red don’t hide blue jays ? ? ? ? ? ? red don’t leaves hide blue jays ? ? ? ? ? ? red leaves hide don’t blue jays ? ? ? ? ? ? red leaves don’t blue hide jays ? ? ? ? ? ? red leaves don’t hide jays blue sentences in TRANS1 neighborhood p ? ? ? ? ? ? red leaves don’t hide blue jays leaves don’t red hide www.dyna.org p blue jays blue jays (shameless self promotion) don’t (with any tagging) hide sentences in TRANS1 neighborhood The New Modeling Imperative A good sentence hints that a set of bad ones is nearby. numerator denominator (“neighborhood”) “Make the good sentence likely, at the expense of those bad neighbors.” This talk is about denominators ... in the unsupervised case. A good denominator can improve accuracy and tractability. Log-Linear Models score of x, y partition function Computing Z is undesirable! Conditional Estimation (Supervised) 1 sentence Contrastive Estimation (Unsupervised) a few sentences Sums over all possible taggings of all possible sentences! A Big Picture: Sequence Model Estimation unannotated data tractable sums generative, EM: p(x) log-linear, EM: p(x) log-linear, MLE: p(x, y) generative, MLE: p(x, y) log-linear, conditional estimation: p(y | x) overlapping features log-linear, CE with lattice neighborhoods Contrastive Neighborhoods • Guide the learner toward models that do what syntax is supposed to do. • Lattice representation → efficient algorithms. There is an art to choosing neighborhood functions. Neighborhoods neighborhood size lattice arcs DEL1WORD n+1 O(n) delete up to 1 word TRANS1 n O(n) transpose any bigram DELORTRANS1 O(n) O(n) perturbations DEL1WORD TRANS1 DEL1SUBSEQUENCE O(n2) O(n2) delete any contiguous subsequence Σ* (EM) ∞ - replace each word with anything The Merialdo (1994) Task Given unlabeled text and a POS dictionary (that tells all possible tags for each word type), A form of supervision. learn to tag. Trigram Tagging Model JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary CRF 99.5 ≈ log-linear supervised HMM EM 97.2 LENGTH 79.3 TRANS1 79.0 DELORTRANS1 78.8 10 × data DA Smith & Eisner (2004) EM Merialdo (1994) EM 66.6 62.1 DEL1WORD 60.4 DEL1SUBSEQUENCE random 70.0 58.7 35.1 • 96K words 30.0 40.0 50.0 60.0 70.0 80.0 90.0 • full POS dictionary tagging accuracy (ambiguous words) • uninformative initializer • best of 8 smoothing conditions 100.0 55.0 50.0 51.0 60.0 90.4 84.4 80.5 69.5 Dictionary excludes OOV words, which can get any tag. 60.5 56.6 65.0 What if we damage the POS dictionary? ■ all words ■ words from 1st half of corpus ■ words with count 2 ■ words with count 3 66.5 70.0 Dictionary includes ... 70.9 75.0 78.3 75.2 72.3 80.0 77.2 85.0 84.8 81.3 tagging accuracy (all words) 90.0 90.1 95.0 • 96K words • 17 coarse POS tags • uninformative initializer Trigram Tagging Model + Spelling JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary 1- to 3-character suffixes, contains hyphen, digit 60.0 55.0 50.0 60.5 56.6 65.0 73.8 73.6 70.9 69.5 66.5 70.0 78.3 75.2 72.3 75.0 77.2 80.0 84.8 81.3 85.0 90.4 90.1 90.0 51.0 tagging accuracy (all words) 95.0 83.2 84.4 80.5 91.9 91.1 90.8 90.3 89.8 Log-linear spelling features aided recovery ... ... but only with a smart neighborhood. The model need not be finite-state. 35.2 42.1 23.6 33.8 48.7 50.0 40.0 attachment accuracy Klein & Manning (2004) Unsupervised Dependency Parsing 37.4 30.0 EM 20.0 10.0 LENGTH 0.0 unin TRANS1 initializer cleve r form a tive To Sum Up ... Contrastive Estimation means picking your own denominator for tractability or for accuracy (or, as in our case, for both). Now we can use the task to guide the unsupervised learner (like discriminative techniques do for supervised learners). It’s a particularly good fit for log-linear models: unsupervised sequence models with max ent features all in time for ACL 2006.