CS 388: Natural Language Processing: Semantic Parsing Raymond J. Mooney University of Texas at Austin 1 Representing Meaning • Representing the meaning of natural language is ultimately a difficult philosophical question, i.e. the “meaning of meaning”. • Traditional approach is to map ambiguous NL to unambiguous logic in first-order predicate calculus (FOPC). • Standard inference (theorem proving) methods exist for FOPC that can determine when one statement entails (implies) another. Questions can be answered by determining what potential responses are entailed by given NL statements and background knowledge all encoded in FOPC. 2 Model Theoretic Semantics • Meaning of traditional logic is based on model theoretic semantics which defines meaning in terms of a model (a.k.a. possible world), a set-theoretic structure that defines a (potentially infinite) set of objects with properties and relations between them. • A model is a connecting bridge between language and the world by representing the abstract objects and relations that exist in a possible world. • An interpretation is a mapping from logic to the model that defines predicates extensionally, in terms of the set of tuples of objects that make them true (their denotation or extension). – The extension of Red(x) is the set of all red things in the world. – The extension of Father(x,y) is the set of all pairs of objects <A,B> such that A is B’s father. 3 Truth-Conditional Semantics • Model theoretic semantics gives the truth conditions for a sentence, i.e. a model satisfies a logical sentence iff the sentence evaluates to true in the given model. • The meaning of a sentence is therefore defined as the set of all possible worlds in which it is true. 4 Semantic Parsing • Semantic Parsing: Transforming natural language (NL) sentences into completely formal logical forms or meaning representations (MRs). • Sample application domains where MRs are directly executable by another computer system to perform some task. – CLang: Robocup Coach Language – Geoquery: A Database Query Application 5 CLang: RoboCup Coach Language • In RoboCup Coach competition teams compete to coach simulated players [http://www.robocup.org] • The coaching instructions are given in a formal language called CLang [Chen et al. 2003] If the ball is in our goal area then player 1 should intercept it. Simulated soccer field Semantic Parsing (bpos (goal-area our) (do our {1} intercept)) CLang 6 Geoquery: A Database Query Application • Query application for U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] Which rivers run through the states bordering Texas? Arkansas, Canadian, Cimarron, Gila, Mississippi, Rio Grande … Answer Semantic Parsing answer(traverse(next_to(stateid(‘texas’)))) Query 7 Procedural Semantics • The meaning of a sentence is a formal representation of a procedure that performs some action that is an appropriate response. – Answering questions – Following commands • In philosophy, the “late” Wittgenstein was known for the “meaning as use” view of semantics compared to the model theoretic view of the “early” Wittgenstein and other logicians. 8 Predicate Logic Query Language • Most existing work on computational semantics is based on predicate logic What is the smallest state by area? answer(x1,smallest(x2,(state(x1),area(x1,x2)))) x1 is a logical variable that denotes “the smallest state by area” 9 Functional Query Language (FunQL) • Transform a logical language into a functional, variable-free language (Kate et al., 2005) What is the smallest state by area? answer(x1,smallest(x2,(state(x1),area(x1,x2)))) answer(smallest_one(area_1(state(all)))) 10 Learning Semantic Parsers • Manually programming robust semantic parsers is difficult due to the complexity of the task. • Semantic parsers can be learned automatically from sentences paired with their logical form. NLMR Training Exs Natural Language Semantic-Parser Learner Semantic Parser Meaning Rep 11 Engineering Motivation • Most computational language-learning research strives for broad coverage while sacrificing depth. – “Scaling up by dumbing down” • Realistic semantic parsing currently entails domain dependence. • Domain-dependent natural-language interfaces have a large potential market. • Learning makes developing specific applications more tractable. • Training corpora can be easily developed by tagging existing corpora of formal statements with naturallanguage glosses. 12 Cognitive Science Motivation • Most natural-language learning methods require supervised training data that is not available to a child. – General lack of negative feedback on grammar. – No POS-tagged or treebank data. • Assuming a child can infer the likely meaning of an utterance from context, NLMR pairs are more cognitively plausible training data. 13 Our Semantic-Parser Learners • CHILL+WOLFIE (Zelle & Mooney, 1996; Thompson & Mooney, 1999, 2003) – Separates parser-learning and semantic-lexicon learning. – Learns a deterministic parser using ILP techniques. • COCKTAIL (Tang & Mooney, 2001) – Improved ILP algorithm for CHILL. • SILT (Kate, Wong & Mooney, 2005) – Learns symbolic transformation rules for mapping directly from NL to LF. • SCISSOR (Ge & Mooney, 2005) – Integrates semantic interpretation into Collins’ statistical syntactic parser. • WASP (Wong & Mooney, 2006) – Uses syntax-based statistical machine translation methods. • KRISP (Kate & Mooney, 2006) – Uses a series of SVM classifiers employing a string-kernel to iteratively build semantic representations. 14 CHILL (Zelle & Mooney, 1992-96) • Semantic parser acquisition system using Inductive Logic Programming (ILP) to induce a parser written in Prolog. • Starts with a deterministic parsing “shell” written in Prolog and learns to control the operators of this parser to produce the given I/O pairs. • Requires a semantic lexicon, which for each word gives one or more possible meaning representations. • Parser must disambiguate words, introduce proper semantic representations for each, and then put them together in the right way to produce a proper representation of the sentence. 15 CHILL Example • U.S. Geographical database – Sample training pair • Cuál es el capital del estado con la población más grande? • answer(C, (capital(S,C), largest(P, (state(S), population(S,P))))) – Sample semantic lexicon • • • • • cuál : answer(_,_) capital: capital(_,_) estado: state(_) más grande: largest(_,_) población: population(_,_) 16 WOLFIE (Thompson & Mooney, 1995-1999) • Learns a semantic lexicon for CHILL from the same corpus of semantically annotated sentences. • Determines hypotheses for word meanings by finding largest isomorphic common subgraphs shared by meanings of sentences in which the word appears. • Uses a greedy-covering style algorithm to learn a small lexicon sufficient to allow compositional construction of the correct representation from the words in a sentence. 17 WOLFIE + CHILL Semantic Parser Acquisition NLMR Training Exs WOLFIE Lexicon Learner Semantic Lexicon CHILL Parser Learner Natural Language Semantic Parser Meaning Rep 18 Compositional Semantics • Approach to semantic analysis based on building up an MR compositionally based on the syntactic structure of a sentence. • Build MR recursively bottom-up from the parse tree. BuildMR(parse-tree) If parse-tree is a terminal node (word) then return an atomic lexical meaning for the word. Else For each child, subtreei, of parse-tree Create its MR by calling BuildMR(subtreei) Return an MR by properly combining the resulting MRs for its children into an MR for the overall parse-tree. Composing MRs from Parse Trees What is the capital of Ohio? S answer(capital(loc_2(stateid('ohio')))) NP WP What VP capital(loc_2(stateid('ohio'))) answer() answer() answer() NP V capital(loc_2(stateid('ohio'))) VBZ DT N capital() PP loc_2(stateid('ohio')) is the capital IN loc_2() NP stateid('ohio') capital() of NNPstateid('ohio') loc_2() Ohio stateid('ohio') 20 Disambiguation with Compositional Semantics • The composition function that combines the MRs of the children of a node, can return if there is no sensible way to compose the children’s meanings. • Could compute all parse trees up-front and then compute semantics for each, eliminating any that ever generate a semantics for any constituent. • More efficient method: – When filling (CKY) chart of syntactic phrases, also compute all possible compositional semantics of each phrase as it is constructed and make an entry for each. – If a given phrase only gives semantics, then remove this phrase from the table, thereby eliminating any parse that includes this meaningless phrase. Composing MRs from Parse Trees What is the capital of Ohio? S NP WP What VP NP V VBZ is DT PP N the capital IN loc_2() NP riverid('ohio') of NNPriverid('ohio') loc_2() Ohio riverid('ohio') 22 Composing MRs from Parse Trees What is the capital of Ohio? S VP NP WP What NP capital() V PPloc_2(stateid('ohio')) VBZ DT N capital() IN loc_2() NP stateid('ohio') stateid('ohio') NNP of is the capital capital() loc_2() Ohio stateid('ohio') SCISSOR: Semantic Composition that Integrates Syntax and Semantics to get Optimal Representations 24 SCISSOR • An integrated syntax-based approach – Allows both syntax and semantics to be used simultaneously to build meaning representations • A statistical parser is used to generate a semantically augmented parse tree (SAPT) S-bowner NP-player VP-bowner • Translate a SAPT into a complete formal meaning PRP$-team NN-player CD-unum NP-null representation (MR) usingVB-bowner a meaning composition our player 2 has DT-null NN-null process MR: bowner(player(our,2)) the ball 25 Semantic Composition Example S-bowner(player(our,2)) NP-player(our,2) PRP$-our our VP-bowner(_) NN-player(_,_) CD-2 player VB-bowner(_) 2 require no argumentsrequire arguments player(team,unum) has NP-null DT-null NN-null the ball semantic vacuous bowner(player) 26 Semantic Composition Example S-bowner(player(our,2)) NP-player(our,2) PRP$-our our VP-bowner(_) NN-player(_,_) CD-2 player 2 VB-bowner(_) has NP-null DT-null NN-null the ball player(team,unum) bowner(player) 27 Semantic Composition Example S-bowner(player(our,2)) NP-player(our,2) PRP$-our our VP-bowner(_) NN-player(_,_) CD-2 player 2 VB-bowner(_) has NP-null DT-null NN-null the ball player(team,unum) bowner(player) 28 SCISSOR • An integrated syntax-based approach – Allows both syntax and semantics to be used simultaneously to build meaning representations • A statistical parser is used to generate a semantically augmented parse tree (SAPT) • Translate a SAPT into a complete formal meaning representation (MR) using a meaning composition process • Allow statistical modeling of semantic selectional constraints in application domains – (AGENT pass) = PLAYER 29 Overview of SCISSOR NL Sentence SAPT Training Examples learner Integrated Semantic Parser SAPT TRAINING TESTING ComposeMR MR 30 Extending Collins’ (1997) Syntactic Parsing Model • Collins’ (1997) introduced a lexicalized headdriven syntactic parsing model • Bikel’s (2004) provides an easily-extended opensource version of the Collins statistical parser • Extending the parsing model to generate semantic labels simultaneously with syntactic labels constrained by semantic constraints in application domains 31 Integrating Semantics into the Model • Use the same Markov processes • Add a semantic label to each node • Add semantic subcat frames – Give semantic subcategorization preferences – bowner takes a player as its argument S(has) S-bowner(has ) NP-player(player) NP(player) VP(has) VP-bowner(has) NP(ball) NP-null(ball) PRP$-team PRP$ NN-player NN CD-unum CD VB VB-bowner DTDT-nullNN NN-null our our player player 2 2 has has the the ball ball 32 Adding Semantic Labels into the Model S-bowner(has) VP-bowner(has) Ph(VP-bowner | S-bowner, has) 33 Adding Semantic Labels into the Model S-bowner(has) VP-bowner(has) {NP}-{player} { }-{ } Ph(VP-bowner | S-bowner, has) × Plc({NP}-{player} | S-bowner, VP-bowner, has) × Prc({}-{}| S-bowner, VP-bowner, has) 34 Adding Semantic Labels into the Model S-bowner(has) NP-player(player) VP-bowner(has) {NP}-{player} { }-{ } Ph(VP-bowner | S-bowner, has) × Plc({NP}-{player} | S-bowner, VP-bowner, has) × Prc({}-{}| S-bowner, VP-bowner, has) × Pd(NP-player(player) | S-bowner, VP-bowner, has, LEFT, {NP}-{player}) 35 Adding Semantic Labels into the Model S-bowner(has) NP-player(player) VP-bowner(has) { }-{ } { }-{ } Ph(VP-bowner | S-bowner, has) × Plc({NP}-{player} | S-bowner, VP-bowner, has) × Prc({}-{}| S-bowner, VP-bowner, has) × Pd(NP-player(player) | S-bowner, VP-bowner, has, LEFT, {NP}-{player}) 36 Adding Semantic Labels into the Model S-bowner(has) STOP NP-player(player) VP-bowner(has) { }-{ } { }-{ } Ph(VP-bowner | S-bowner, has) × Plc({NP}-{player} | S-bowner, VP-bowner, has) × Prc({}-{}| S-bowner, VP-bowner, has) × Pd(NP-player(player) | S-bowner, VP-bowner, has, LEFT, {NP}-{player}) × Pd(STOP | S-bowner, VP-bowner, has, LEFT, {}-{}) 37 Adding Semantic Labels into the Model S-bowner(has) STOP NP-player(player) VP-bowner(has) { }-{ } STOP { }-{ } Ph(VP-bowner | S-bowner, has) × Plc({NP}-{player} | S-bowner, VP-bowner, has) × Prc({}-{}| S-bowner, VP-bowner, has) × Pd(NP-player(player) | S-bowner, VP-bowner, has, LEFT, {NP}-{player}) × Pd(STOP | S-bowner, VP-bowner, has, LEFT, {}-{}) × Pd(STOP | S-bowner, VP-bowner, has, RIGHT, {}-{}) 38 SCISSOR Parser Implementation • Supervised training on annotated SAPTs is just frequency counting • Augmented smoothing technique is employed to account for additional data sparsity created by semantic labels. • Parsing of test sentences to find the most probable SAPT is performed using a variant of standard CKY chart-parsing algorithm. 39 Smoothing • Each label in SAPT is the combination of a syntactic label and a semantic label • Increases data sparsity • Use Bayes rule to break the parameters down Ph(H | P, w) = Ph(Hsyn, Hsem | P, w) = Ph(Hsyn | P, w) × Ph(Hsem | P, w, Hsyn) 40 Learning Semantic Parsers with a Formal Grammar for Meaning Representations • Our other techniques assume that meaning representation languages (MRLs) have deterministic context free grammars – True for almost all computer languages – MRs can be parsed unambiguously 41 NL: Which rivers run through the states bordering Texas? MR: answer(traverse(next_to(stateid(‘texas’)))) Parse tree of MR: ANSWER RIVER answer TRAVERSE traverse STATE NEXT_TO STATE next_to STATEID stateid ‘texas’ Non-terminals: ANSWER, RIVER, TRAVERSE, STATE, NEXT_TO, STATEID Terminals: answer, traverse, next_to, stateid, ‘texas’ Productions: ANSWER answer(RIVER), RIVER TRAVERSE(STATE), STATE NEXT_TO(STATE), TRAVERSE traverse, NEXT_TO next_to, STATEID ‘texas’ 42 KRISP: Kernel-based Robust Interpretation for Semantic Parsing • Learns semantic parser from NL sentences paired with their respective MRs given MRL grammar • Productions of MRL are treated like semantic concepts • SVM classifier with string subsequence kernel is trained for each production to identify if an NL substring represents the semantic concept • These classifiers are used to compositionally build MRs of the sentences 43 Overview of KRISP MRL Grammar NL sentences with MRs Collect positive and negative examples Train string-kernel-based SVM classifiers Training Testing Novel NL sentences Best MRs (correct and incorrect) Semantic Parser Best MRs 44 Overview of KRISP MRL Grammar NL sentences with MRs Collect positive and negative examples Train string-kernel-based SVM classifiers Training Testing Novel NL sentences Best MRs (correct and incorrect) Semantic Parser Best MRs 45 KRISP’s Semantic Parsing • We first define Semantic Derivation of an NL sentence • We next define Probability of a Semantic Derivation • Semantic parsing of an NL sentence involves finding its Most Probable Semantic Derivation • Straightforward to obtain MR from a semantic derivation 46 Semantic Derivation of an NL Sentence MR parse with non-terminals on the nodes: ANSWER RIVER answer TRAVERSE traverse STATE NEXT_TO STATE next_to STATEID stateid ‘texas’ Which rivers run through the states bordering Texas? 47 Semantic Derivation of an NL Sentence MR parse with productions on the nodes: ANSWER answer(RIVER) RIVER TRAVERSE(STATE) TRAVERSE traverse STATE NEXT_TO(STATE) NEXT_TO next_to STATE STATEID STATEID ‘texas’ Which rivers run through the states bordering Texas? 48 Semantic Derivation of an NL Sentence Semantic Derivation: Each node covers an NL substring: ANSWER answer(RIVER) RIVER TRAVERSE(STATE) TRAVERSE traverse STATE NEXT_TO(STATE) NEXT_TO next_to STATE STATEID STATEID ‘texas’ Which rivers run through the states bordering Texas? 49 Semantic Derivation of an NL Sentence Semantic Derivation: Each node contains a production and the substring of NL sentence it covers: (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO(STATE), [5..9]) (NEXT_TO next_to, [5..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9 50 Semantic Derivation of an NL Sentence Substrings in NL sentence may be in a different order: ANSWER answer(RIVER) RIVER TRAVERSE(STATE) TRAVERSE traverse STATE NEXT_TO(STATE) NEXT_TO next_to STATE STATEID STATEID ‘texas’ Through the states that border Texas which rivers run? 51 Semantic Derivation of an NL Sentence Nodes are allowed to permute the children productions from the original MR parse (ANSWER answer(RIVER), [1..10]) (RIVER TRAVERSE(STATE), [1..10]] (STATE NEXT_TO(STATE), [1..6]) (NEXT_TO next_to, [1..5]) (TRAVERSE traverse, [7..10]) (STATE STATEID, [6..6]) (STATEID ‘texas’, [6..6]) Through the states that border Texas which rivers run? 1 2 3 4 5 6 7 8 9 10 52 Probability of a Semantic Derivation • Let Pπ(s[i..j]) be the probability that production π covers the substring s[i..j] of sentence s • For e.g., PNEXT_TO next_to (“the states bordering”) (NEXT_TO next_to, [5..7]) 0.99 the states bordering 5 6 7 • Obtained from the string-kernel-based SVM classifiers trained for each production π • Assuming independence, probability of a semantic derivation D: P ( D) P (s[i.. j ]) ( ,[ i .. j ])D 53 Probability of a Semantic Derivation contd. (ANSWER answer(RIVER), [1..9]) 0.98 (RIVER TRAVERSE(STATE), [1..9]) 0.9 (TRAVERSE traverse, [1..4]) (STATE NEXT_TO(STATE), [5..9]) 0.95 0.89 (NEXT_TO next_to, [5..7]) (STATE STATEID, [8..9]) 0.99 0.93 (STATEID ‘texas’, [8..9]) 0.98 Which rivers run through the states bordering Texas? 1 2 3 P ( D) 4 5 6 7 8 9 P (s[i.. j ]) 0.673 ( ,[ i .. j ])D 54 Computing the Most Probable Semantic Derivation • Task of semantic parsing is to find the most probable semantic derivation of the NL sentence given all the probabilities Pπ(s[i..j]) • Implemented by extending Earley’s [1970] context-free grammar parsing algorithm • Resembles PCFG parsing but different because: – Probability of a production depends on which substring of the sentence it covers – Leaves are not terminals but substrings of words 55 Computing the Most Probable Semantic Derivation contd. • Does a greedy approximation search, with beam width ω=20, and returns ω most probable derivations it finds • Uses a threshold θ=0.05 to prune low probability trees 56 Overview of KRISP MRL Grammar NL sentences with MRs Collect positive and negative examples Train string-kernel-based SVM classifiers Best semantic derivations (correct and incorrect) Pπ(s[i..j]) Training Testing Novel NL sentences Semantic Parser Best MRs 57 KRISP’s Training Algorithm • Takes NL sentences paired with their respective MRs as input • Obtains MR parses • Induces the semantic parser using an SVM with a string subsequence kernel and refines it in iterations • In the first iteration, for every production π: – Call those sentences positives whose MR parses use that production – Call the remaining sentences negatives 58 Support Vector Machines • Recent approach based on extending a neuralnetwork approach like Perceptron. • Finds that linear separator that maximizes the margin between the classes. • Based in computational learning theory, which explains why max-margin is a good approach (Vapnik, 1995). • Good at avoiding over-fitting in high-dimensional feature spaces. • Performs well on various text and language problems, which tend to be high-dimensional. 59 59 Picking a Linear Separator • Which of the alternative linear separators is best? 60 60 Classification Margin • Consider the distance of points from the separator. • Examples closest to the hyperplane are support vectors. • Margin ρ of the separator is the width of separation between classes. ρ r 61 61 SVM Algorithms • Finding the max-margin separator is an optimization problem called quadratic optimization. • Algorithms that guarantee an optimal margin take at least O(n2) time and do not scale well to large data sets. • Approximation algorithms like SVM-light (Joachims, 1999) and SMO (Platt, 1999) allow scaling to realistic problems. 62 62 Kernels • SVMs can be extended to learning non-linear separators by using kernel functions. • A kernel function is a similarity function between two instances, K(x1,x2), that must satisfy certain mathematical constraints. • A kernel function implicitly maps instances into a higher dimensional feature space where (hopefully) the categories are linearly separable. • A kernel-based method (like SVMs) can use a kernel to implicitly operate in this higher-dimensional space without having to explicitly map instances into this much larger (perhaps infinite) space (called “the kernel trick”). • Kernels can be defined on non-vector data like strings, trees, and graphs, allowing the application of kernel-based methods to complex, unbounded, non-vector data structures. 63 63 Non-linear SVMs: Feature spaces • General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) 64 64 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” K(s,t) = ? 65 65 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = states K(s,t) = 1+? 66 66 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = next K(s,t) = 2+? 67 67 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = to K(s,t) = 3+? 68 68 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = states next K(s,t) = 4+? 69 69 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = states to K(s,t) = 5+? 70 70 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = next to K(s,t) = 6+? 71 71 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] s = “states that are next to” t = “the states next to” u = states next to K(s,t) = 7 72 72 KRISP’s Training Algorithm contd. First Iteration STATE NEXT_TO(STATE) Positives Negatives •which rivers run through the states bordering texas? •what state has the highest population ? •what is the most populated state bordering oklahoma ? •which states have cities named austin ? •what states does the delaware river run through ? •what is the largest city in states that border california ? •what is the lowest point of the state with the largest area ? … … String-kernel-based SVM classifier 73 String Subsequence Kernel • The examples are implicitly mapped to the feature space of all subsequences and the kernel computes the dot products state with the capital of states with area larger than states through which the states next to states that border states bordering states that share border 74 Support Vector Machines • SVMs find a separating hyperplane such that the margin is maximized Separating hyperplane state with the capital of states that are next to states with area larger than states through which 0.97 the states next to states that border states bordering states that share border Probability estimate of an example belonging to a class can be obtained using its distance from the hyperplane [Platt, 1999] 75 KRISP’s Training Algorithm contd. First Iteration STATE NEXT_TO(STATE) Positives Negatives •which rivers run through the states bordering texas? •what state has the highest population ? •what is the most populated state bordering oklahoma ? •which states have cities named austin ? •what states does the delaware river run through ? •what is the largest city in states that border california ? •what is the lowest point of the state with the largest area ? … … String-kernel-based SVM classifier PSTATENEXT_TO(STATE) (s[i..j]) 76 Overview of KRISP MRL Grammar NL sentences with MRs Collect positive and negative examples Train string-kernel-based SVM classifiers Best semantic derivations (correct and incorrect) Pπ(s[i..j]) Training Testing Novel NL sentences Semantic Parser Best MRs 77 Overview of KRISP MRL Grammar NL sentences with MRs Collect positive and negative examples Train string-kernel-based SVM classifiers Best semantic derivations (correct and incorrect) Pπ(s[i..j]) Training Testing Novel NL sentences Semantic Parser Best MRs 78 KRISP’s Training Algorithm contd. • Using these classifiers Pπ(s[i..j]), obtain the ω best semantic derivations of each training sentence • Some of these derivations will give the correct MR, called correct derivations, some will give incorrect MRs, called incorrect derivations • For the next iteration, collect positives from most probable correct derivation • Extended Earley’s algorithm can be forced to derive only the correct derivations by making sure all subtrees it generates exist in the correct MR parse • Collect negatives from incorrect derivations with higher probability than the most probable correct derivation 79 KRISP’s Training Algorithm contd. Most probable correct derivation: (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO(STATE), [5..9]) (NEXT_TO next_to, [5..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9 80 KRISP’s Training Algorithm contd. Most probable correct derivation: Collect positive examples (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO(STATE), [5..9]) (NEXT_TO next_to, [5..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9 81 KRISP’s Training Algorithm contd. Incorrect derivation with probability greater than the most probable correct derivation: (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 Incorrect MR: answer(traverse(stateid(‘texas’))) 7 8 9 82 KRISP’s Training Algorithm contd. Incorrect derivation with probability greater than the most probable correct derivation: Collect negative examples (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 Incorrect MR: answer(traverse(stateid(‘texas’))) 7 8 9 83 KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: (ANSWER answer(RIVER), [1..9]) (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO (STATE), [5..9]) (NEXT_TO next_to, [5..7]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? (STATE STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Which rivers run through the states bordering Texas? Traverse both trees in breadth-first order till the first nodes where their productions differ are found. 84 KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: (ANSWER answer(RIVER), [1..9]) (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO (STATE), [5..9]) (NEXT_TO next_to, [5..7]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? (STATE STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Which rivers run through the states bordering Texas? Traverse both trees in breadth-first order till the first nodes where their productions differ are found. 85 KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: (ANSWER answer(RIVER), [1..9]) (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO (STATE), [5..9]) (NEXT_TO next_to, [5..7]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? (STATE STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Which rivers run through the states bordering Texas? Traverse both trees in breadth-first order till the first nodes where their productions differ are found. 86 KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: (ANSWER answer(RIVER), [1..9]) (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO (STATE), [5..9]) (NEXT_TO next_to, [5..7]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? (STATE STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Which rivers run through the states bordering Texas? Traverse both trees in breadth-first order till the first nodes where their productions differ are found. 87 KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: (ANSWER answer(RIVER), [1..9]) (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO (STATE), [5..9]) (NEXT_TO next_to, [5..7]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? (STATE STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Which rivers run through the states bordering Texas? Traverse both trees in breadth-first order till the first nodes where their productions differ are found. 88 KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: (ANSWER answer(RIVER), [1..9]) (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO (STATE), [5..9]) (NEXT_TO next_to, [5..7]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? (STATE STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Which rivers run through the states bordering Texas? Mark the words under these nodes. 89 KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: (ANSWER answer(RIVER), [1..9]) (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO (STATE), [5..9]) (NEXT_TO next_to, [5..7]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? (STATE STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Which rivers run through the states bordering Texas? Mark the words under these nodes. 90 KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: (ANSWER answer(RIVER), [1..9]) (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO (STATE), [5..9]) (NEXT_TO next_to, [5..7]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? (STATE STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Which rivers run through the states bordering Texas? Consider all the productions covering the marked words. Collect negatives for productions which cover any marked word 91 in incorrect derivation but not in the correct derivation. 91 KRISP’s Training Algorithm contd. Most Probable Correct derivation: Incorrect derivation: (ANSWER answer(RIVER), [1..9]) (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..4]) (STATE NEXT_TO (STATE), [5..9]) (NEXT_TO next_to, [5..7]) (RIVER TRAVERSE(STATE), [1..9]) (TRAVERSE traverse, [1..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? (STATE STATEID, [8..9]) (STATEID ‘texas’,[8..9]) Which rivers run through the states bordering Texas? Consider the productions covering the marked words. Collect negatives for productions which cover any marked word in incorrect derivation but not in the correct derivation. 92 KRISP’s Training Algorithm contd. Next Iteration: more refined positive and negative examples STATE NEXT_TO(STATE) Positives Negatives •the states bordering texas? •what state has the highest population ? •state bordering oklahoma ? •what states does the delaware river run through ? •states that border california ? •which states have cities named austin ? •states which share border •what is the lowest point of the state with the largest area ? •next to state of iowa •which rivers run through states bordering … … String-kernel-based SVM classifier PSTATENEXT_TO(STATE) (s[i..j]) 93 Overview of KRISP MRL Grammar NL sentences with MRs Collect positive and negative examples Train string-kernel-based SVM classifiers Best semantic derivations (correct and incorrect) Pπ(s[i..j]) Training Testing Novel NL sentences Semantic Parser Best MRs 94 WASP A Machine Translation Approach to Semantic Parsing • Uses statistical machine translation techniques – Synchronous context-free grammars (SCFG) (Wu, 1997; Melamed, 2004; Chiang, 2005) – Word alignments (Brown et al., 1993; Och & Ney, 2003) • Hence the name: Word Alignment-based Semantic Parsing 95 A Unifying Framework for Parsing and Generation Natural Languages Machine translation 96 A Unifying Framework for Parsing and Generation Natural Languages Semantic parsing Machine translation Formal Languages 97 A Unifying Framework for Parsing and Generation Natural Languages Semantic parsing Machine translation Tactical generation Formal Languages 98 A Unifying Framework for Parsing and Generation Synchronous Parsing Natural Languages Semantic parsing Machine translation Tactical generation Formal Languages 99 A Unifying Framework for Parsing and Generation Synchronous Parsing Natural Languages Semantic parsing Machine translation Compiling: Aho & Ullman (1972) Tactical generation Formal Languages 100 Synchronous Context-Free Grammars (SCFG) • Developed by Aho & Ullman (1972) as a theory of compilers that combines syntax analysis and code generation in a single phase • Generates a pair of strings in a single derivation 101 Context-Free Semantic Grammar QUERY QUERY What is CITY What is CITY CITY the capital CITY CITY of STATE STATE Ohio the capital of CITY STATE Ohio 102 Productions of Synchronous Context-Free Grammars Natural language Formal language QUERY What is CITY / answer(CITY) 103 Synchronous Context-Free Grammar Derivation QUERY What is the QUERY answer CITY capital of ( capital CITY STATE Ohio CITY ( loc_2 ) CITY ( stateid ) STATE ( ) 'ohio' ) CITY Ohio the capital CITY capital(CITY) QUERY CITY What of STATE is CITY loc_2(STATE) // answer(CITY) answer(capital(loc_2(stateid('ohio')))) STATE Ohio //stateid('ohio') What is the capital of 104 Probabilistic Parsing Model d1 CITY CITY capital capital CITY of STATE Ohio ( loc_2 CITY ( ) STATE stateid ( ) 'ohio' ) CITY capital CITY / capital(CITY) CITY of STATE / loc_2(STATE) STATE Ohio / stateid('ohio') 105 Probabilistic Parsing Model d2 CITY CITY capital capital CITY of RIVER Ohio ( loc_2 CITY ( ) RIVER riverid ( ) 'ohio' ) CITY capital CITY / capital(CITY) CITY of RIVER / loc_2(RIVER) RIVER Ohio / riverid('ohio') 106 Probabilistic Parsing Model d1 d2 CITY capital ( loc_2 CITY ( stateid ) capital STATE ( CITY ) 'ohio' loc_2 ) CITY capital CITY / capital(CITY) 0.5 CITY of STATE / loc_2(STATE) 0.3 STATE Ohio / stateid('ohio') 0.5 + ( CITY ( riverid λ Pr(d1|capital of Ohio) = exp( 1.3 ) / Z ) RIVER ( ) 'ohio' ) CITY capital CITY / capital(CITY) 0.5 CITY of RIVER / loc_2(RIVER) 0.05 RIVER Ohio / riverid('ohio') 0.5 + λ Pr(d2|capital of Ohio) = exp( 1.05 ) / Z normalization constant 107 Overview of WASP Unambiguous CFG of MRL Lexical acquisition Training set, {(e,f)} Lexicon, L Parameter estimation Training Parsing model parameterized by λ Testing Input sentence, e' Semantic parsing Output MR, f' 108 Lexical Acquisition • Transformation rules are extracted from word alignments between an NL sentence, e, and its correct MR, f, for each training example, (e, f) 109 Word Alignments Le And programme the a program été has mis en been application implemented • A mapping from French words to their meanings expressed in English 110 Lexical Acquisition • Train a statistical word alignment model (IBM Model 5) on training set • Obtain most probable n-to-1 word alignments for each training example • Extract transformation rules from these word alignments • Lexicon L consists of all extracted transformation rules 111 Word Alignment for Semantic Parsing The goalie should always stay in our half ( ( true ) ( do our { 1 } ( pos ( half our ) ) ) ) • How to introduce syntactic tokens such as parens? 112 Use of MRL Grammar RULE (CONDITION DIRECTIVE) The CONDITION (true) goalie should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM our UNUM 1 stay ACTION (pos REGION) in n-to-1 our REGION (half TEAM) half TEAM our top-down, left-most derivation of an unambiguous CFG 113 Extracting Transformation Rules The goalie should always stay in our TEAM half RULE (CONDITION DIRECTIVE) CONDITION (true) DIRECTIVE (do TEAM {UNUM} ACTION) TEAM our UNUM 1 ACTION (pos REGION) REGION (half TEAM) TEAM our TEAM our / our 114 Extracting Transformation Rules The goalie should always stay in REGION TEAM half RULE (CONDITION DIRECTIVE) CONDITION (true) DIRECTIVE (do TEAM {UNUM} ACTION) TEAM our UNUM 1 ACTION (pos REGION) REGION (half TEAM) our) TEAM our REGION TEAM half / (half TEAM) 115 Extracting Transformation Rules The goalie should always ACTION stay in REGION RULE (CONDITION DIRECTIVE) CONDITION (true) DIRECTIVE (do TEAM {UNUM} ACTION) TEAM our UNUM 1 ACTION (pos (half REGION) our)) REGION (half our) ACTION stay in REGION / (pos REGION) 116 Probabilistic Parsing Model • Based on maximum-entropy model: 1 Pr (d | e) exp i f i (d) Z (e) i • Features fi (d) are number of times each transformation rule is used in a derivation d • Output translation is the yield of most probable derivation f * marg max d Prλ (d | e) 117 Parameter Estimation • Maximum conditional log-likelihood criterion λ arg max λ * log Pr (f | e) λ ( e ,f ) • Since correct derivations are not included in training data, parameters λ* are learned in an unsupervised manner • EM algorithm combined with improved iterative scaling, where hidden variables are correct derivations (Riezler et al., 2000) 118 Experimental Corpora • CLang – 300 randomly selected pieces of coaching advice from the log files of the 2003 RoboCup Coach Competition – 22.52 words on average in NL sentences – 14.24 tokens on average in formal expressions • GeoQuery [Zelle & Mooney, 1996] – – – – 250 queries for the given U.S. geography database 6.87 words on average in NL sentences 5.32 tokens on average in formal expressions Also translated into Spanish, Turkish, & Japanese. 119 Experimental Methodology • Evaluated using standard 10-fold cross validation • Correctness – CLang: output exactly matches the correct representation – Geoquery: the resulting query retrieves the same answer as the correct representation • Metrics | Correct Completed Parses | Precision | Completed Parses | |Correct Completed Parses| Recall |Sentences| 120 Precision Learning Curve for CLang 121 Recall Learning Curve for CLang 122 Precision Learning Curve for GeoQuery 123 Recall Learning Curve for Geoquery 124 Precision Learning Curve for GeoQuery (WASP) 125 Recall Learning Curve for GeoQuery (WASP) 126 λWASP • Logical forms can be made more isomorphic to NL sentences than FunQL and allow for better compositionality and generalization. • Version of WASP that uses λ calculus to introduce and bind logical variables. – Standard in compositional formal semantics, e.g. Montague semantics. • Modify SCFG to λ-SCFG 127 SCFG Derivations QUERY QUERY SCFG Derivations QUERY What is FORM QUERY answer(x1, FORM ) QUERY What is FORM / answer(x1,FORM) SCFG Derivations QUERY What is the smallest FORM FORM FORM QUERY answer(x1, FORM ) smallest(x2,( FORM , FORM )) FORM the smallest FORM FORM / smallest(x2,(FORM,FORM)) SCFG Derivations QUERY What is the smallest QUERY FORM answer(x1, FORM ) FORM FORM smallest(x2,( FORM , FORM )) state state(x1) FORM state / state(x1) SCFG Derivations QUERY What is the smallest QUERY FORM answer(x1, FORM ) FORM FORM state by area smallest(x2,( FORM , FORM )) state(x1) area(x1,x2) FORM by area / area(x1,x2) SCFG Derivations QUERY What is the smallest FORM FORM FORM state by area What is the smallest state by area QUERY answer(x1, FORM ) smallest(x2,( FORM , FORM )) state(x1) area(x1,x2) answer(x1,smallest(x2,(state(x1),area(x1,x2)))) SCFG Derivations QUERY What is the smallest FORM FORM FORM state by area What is the smallest state by area QUERY answer(x1, FORM ) smallest(x2,( FORM , FORM ??? )) state(x1) area(x1,x2) answer(x1,smallest(x2,(state(x1),area(x1,x2)))) λ-SCFG Derivations QUERY What is the smallest FORM FORM FORM state by area What is the smallest state by area QUERY answer(x1, FORM ) λx1.smallest(x2,( FORM , FORM )) λx1.state(x1) λx1.λx2.area(x1,x2) answer(x1,smallest(x2,(state(x1),area(x1,x2)))) λ-SCFG Derivations QUERY What is the smallest FORM FORM FORM state by area What is the smallest state by area QUERY answer(x1, FORM(x1) ) λx1.smallest(x2,( FORM(x1) , FORM(x1,x2) )) λx1.state(x1) λx1.λx2.area(x1,x2) answer(x1,smallest(x2,(state(x1),area(x1,x2)))) λ-SCFG Derivations QUERY What is the smallest FORM FORM FORM state by area What is the smallest state by area QUERY answer(x1, FORM(x1) ) λx1.smallest(x2,( FORM(x1) , FORM(x1,x2) )) λx1.state(x1) λx1.λx2.area(x1,x2) answer(x1,smallest(x2,(state(x1),area(x1,x2)))) λ-SCFG Production Rules NL string: FORM smallest FORM FORM / MR string: λx1.smallest(x2,( FORM(x1) , FORM(x1,x2) )) Variable-binding λ-operator: Binds occurrences of x1 in the MR string Argument lists: For function applications 138 Yield of λ-SCFG Derivations QUERY What is the smallest FORM FORM FORM state by area QUERY answer(x1, FORM(x1) ) λx1.smallest(x2,( FORM(x1) , FORM(x1,x2) )) λx1.state(x1) λx1.λx2.area(x1,x2) ??? What is the smallest state by area answer(x1,smallest(x2,(state(x1),area(x1,x2)))) Computing Yield with Lambda Calculus QUERY answer(x1, FORM(x1) ) λx2.smallest(x1,( FORM(x2) , FORM(x2,x1) )) λx1.state(x1) λx1.λx2.area(x1,x2) Computing Yield with Lambda Calculus QUERY answer(x1, FORM(x1) ) λx2.smallest(x1,( FORM(x2) , FORM(x2,x1) )) λx1.state(x1) λx1.λx2.area(x1,x2) Lambda functions Computing Yield with Lambda Calculus QUERY answer(x1, FORM(x1) ) λx2.smallest(x1,( (λx1.state(x1))(x2) , (λx1.λx2.area(x1,x2))(x2,x1) )) Computing Yield with Lambda Calculus QUERY answer(x1, FORM(x1) ) λx2.smallest(x1,( (λx1.state(x1))(x2) , (λx1.λx2.area(x1,x2))(x2,x1) )) Function application: Replace bound occurrences of x1 with x2 (λx1.f(x1))(x2) = f(x2) Computing Yield with Lambda Calculus QUERY answer(x1, FORM(x1) ) λx2.smallest(x1,( state(x2) , (λx1.λx2.area(x1,x2))(x2,x1) )) Computing Yield with Lambda Calculus QUERY answer(x1, FORM(x1) ) λx2.smallest(x1,( state(x2) , area(x2,x1) )) Computing Yield with Lambda Calculus QUERY answer(x1, FORM(x1) ) λx2.smallest(x1,( state(x2) , area(x2,x1) )) Lambda function Computing Yield with Lambda Calculus QUERY answer(x1, λx2.smallest(x1,(state(x2),area(x2,x1)))(x1) ) Computing Yield with Lambda Calculus QUERY answer(x1, smallest(x3,(state(x1),area(x1,x3))) ) Computing Yield with Lambda Calculus QUERY answer(x1, smallest(x3,(state(x1),area(x1,x3))) ) Computing Yield with Lambda Calculus answer(x1,smallest(x3,(state(x1),area(x1,x3)))) Logical form free of λ-operators with logical variables properly named Learning in λWASP • Must update induction of SCFG rules to introduce λ functions and produce a λSCFG. λWASP Results on Geoquery 152 Tactical Natural Language Generation • Mapping a formal MR into NL • Can be done using statistical machine translation – Previous work focuses on using generation in interlingual MT (Hajič et al., 2004) – There has been little, if any, research on exploiting statistical MT methods for generation 153 Tactical Generation • Can be seen as inverse of semantic parsing The goalie should always stay in our half Semantic parsing Tactical generation ((true) (do our {1} (pos (half our)))) 154 Generation by Inverting WASP • Same synchronous grammar is used for both generation and semantic parsing Tactical generation: Semantic parsing: NL: Input Output MRL: QUERY What is CITY / answer(CITY) 155 Generation by Inverting WASP • Same procedure for lexical acquisition • Chart generator very similar to chart parser, but treats MRL as input • Log-linear probabilistic model inspired by Pharaoh (Koehn et al., 2003), a phrasebased MT system • Uses a bigram language model for target NL • Resulting system is called WASP-1 156 Geoquery (NIST score; English) 157 RoboCup (NIST score; English) contiguous phrases only Similar human evaluation results in terms of fluency and adequacy 158 Conclusions • Semantic parsing maps NL sentences to completely formal MRs. • Semantic parsers can be effectively learned from supervised corpora consisting of only sentences paired with their formal MRs (and possibly also SAPTs). • Learning methods can be based on: – Adding semantics to an existing statistical syntactic parser and then using compositional semantics. – Using SVM with string kernels to recognize concepts in the NL and then composing them into a complete MR using the MRL grammar. – Using probabilistic synchronous context-free grammars to learn an NL/MR grammar that supports both semantic parsing and generation.