Machine Learning Group Learning for Semantic Parsing of Natural Language Raymond J. Mooney Ruifang Ge, Rohit Kate, Yuk Wah Wong John Zelle, Cynthia Thompson Machine Learning Group Department of Computer Sciences University of Texas at Austin December 19, 2005 University of Texas at Austin Syntactic Natural Language Learning • Most computational research in natural-language learning has addressed “low-level” syntactic processing. – – – – Morphology (e.g. past-tense generation) Part-of-speech tagging Shallow syntactic parsing (chunking) Syntactic parsing 2 Semantic Natural Language Learning • Learning for semantic analysis has been restricted to relatively “shallow” meaning representations. – Word sense disambiguation (e.g. SENSEVAL) – Semantic role assignment (determining agent, patient, instrument, etc., e.g. FrameNet, PropBank) – Information extraction 3 Semantic Parsing • A semantic parser maps a natural-language sentence to a complete, detailed semantic representation: logical form or meaning representation (MR). • For many applications, the desired output is immediately executable by another program. • Two application domains: – CLang: RoboCup Coach Language – GeoQuery: A Database Query Application 4 CLang: RoboCup Coach Language • In RoboCup Coach competition teams compete to coach simulated players • The coaching instructions are given in a formal language called CLang Coach If the ball is in our penalty area, then all our players except player 4 should stay in our half. Simulated soccer field Semantic Parsing CLang ((bpos (penalty-area our)) (do (player-except our{4}) (pos (half our))) 5 GeoQuery: A Database Query Application • Query application for U.S. geography database containing about 800 facts [Zelle & Mooney, 1996] How many cities are there in the US? User Semantic Parsing Query answer(A, count(B, (city(B), loc(B, C), const(C, countryid(USA))),A)) 6 Learning Semantic Parsers • Manually programming robust semantic parsers is difficult due to the complexity of the task. • Semantic parsers can be learned automatically from sentences paired with their logical form. NLLF Training Exs Natural Language Semantic-Parser Learner Semantic Parser Logical Form 7 Engineering Motivation • Most computational language-learning research strives for broad coverage while sacrificing depth. – “Scaling up by dumbing down” • Realistic semantic parsing currently entails domain dependence. • Domain-dependent natural-language interfaces have a large potential market. • Learning makes developing specific applications more tractable. • Training corpora can be easily developed by tagging existing corpora of formal statements with natural-language glosses. 8 Cognitive Science Motivation • Most natural-language learning methods require supervised training data that is not available to a child. – General lack of negative feedback on grammar. – No POS-tagged or treebank data. • Assuming a child can infer the likely meaning of an utterance from context, NLLF pairs are more cognitively plausible training data. 9 Our Semantic-Parser Learners • CHILL+WOLFIE (Zelle & Mooney, 1996; Thompson & Mooney, 1999, 2003) – Separates parser-learning and semantic-lexicon learning. – Learns a deterministic parser using ILP techniques. • COCKTAIL (Tang & Mooney, 2001) – Improved ILP algorithm for CHILL. • SILT (Kate, Wong & Mooney, 2005) – Learns symbolic transformation rules for mapping directly from NL to LF. • SCISSOR (Ge & Mooney, 2005) – Integrates semantic interpretation into Collins’ statistical syntactic parser. • WASP (Wong & Mooney, in preparation) – Uses syntax-based statistical machine translation methods. • KRISP (Kate & Mooney, in preparation) – Uses a series of SVM classifiers employing a string-kernel to iteratively build semantic representations. 10 SCISSOR: Semantic Composition that Integrates Syntax and Semantics to get Optimal Representations • Based on a fairly standard approach to compositional semantics [Jurafsky and Martin, 2000] • A statistical parser is used to generate a semantically augmented parse tree (SAPT) – Augment Collins’ head-driven model 2 to incorporate S-bowner semantic labels NP-player VP-bowner • Translate a complete formal meaning PRP$-team SAPT NN-playerinto CD-unum VB-bowner NP-null representation (MR) 2 our player has DT-null NN-null MR: bowner(player(our,2)) the ball 11 Overview of SCISSOR NL Sentence SAPT Training Examples learner Integrated Semantic Parser SAPT TRAINING TESTING ComposeMR MR 12 SCISSOR SAPT Parser Implementation • Semantic labels added to Bikel’s (2004) opensource version of the Collins statistical parser. • Head-driven derivation of production rules augmented to also generate semantic labels. • Parameter estimates during training employ an augmented smoothing technique to account for additional data sparsity created by semantic labels. • Parsing of test sentences to find the most probable SAPT is performed using a standard beam-search constrained version of CKY chart-parsing algorithm. 13 ComposeMR bowner player team player our player bowner unum 2 null bowner has null null the ball 14 ComposeMR bowner(_) player(_,_) team player(_,_) our player bowner(_) unum 2 null bowner(_) has null null the ball 15 ComposeMR bowner(player(our,2)) bowner(_) bowner(_) player(our,2) player(_,_) player(_,_) team player(_,_) our player unum 2 null bowner(_) has null null the ball player(team,unum) bowner(player) 16 WASP A Machine Translation Approach to Semantic Parsing • Based on a semantic grammar of the natural language. • Uses machine translation techniques – Synchronous context-free grammars (SCFG) (Wu, 1997; Melamed, 2004; Chiang, 2005) – Word alignments (Brown et al., 1993; Och & Ney, 2003) • Hence the name: Word Alignment-based Semantic Parsing 17 Synchronous Context-Free Grammars (SCFG) • Developed by Aho & Ullman (1972) as a theory of compilers that combines syntax analysis and code generation in a single phase • Generates a pair of strings in a single derivation 18 Compiling, Machine Translation, and Semantic Parsing • SCFG: formal language to formal language (compiling) • Alignment models: natural language to natural language (machine translation) • WASP: natural language to formal language (semantic parsing) 19 Context-Free Semantic Grammar QUERY QUERY What is CITY What is CITY CITY the capital CITY CITY of STATE STATE Ohio the capital of CITY STATE Ohio 20 Productions of Synchronous Context-Free Grammars pattern template QUERY What is CITY / answer(CITY) • Referred to as transformation rules in Kate, Wong & Mooney (2005) 21 Synchronous Context-Free Grammars QUERY What is the QUERY answer CITY capital of ( capital CITY STATE Ohio CITY ( loc_2 ) CITY ( stateid ) STATE ( ) 'ohio' ) CITY Ohio the capital CITY capital(CITY) QUERY CITY What of STATE is CITY loc_2(STATE) // answer(CITY) answer(capital(loc_2(stateid('ohio')))) STATE Ohio //stateid('ohio') What is the capital of 22 Parsing Model of WASP • • • • N (non-terminals) = {QUERY, CITY, STATE, …} S (start symbol) = QUERY Tm (MRL terminals) = {answer, capital, loc_2, (, ), …} Tn (NL words) = {What, is, the, capital, of, Ohio, …} QUERY What is CITY / answer(CITY) • L (lexicon) = CITY the capital CITY / capital(CITY) CITY of STATE / loc_2(STATE) STATE Ohio / stateid('ohio') • λ (parameters of probabilistic model) = ? 23 Probabilistic Parsing Model d1 CITY CITY capital capital CITY of STATE Ohio ( loc_2 CITY ( ) STATE stateid ( ) 'ohio' ) CITY capital CITY / capital(CITY) CITY of STATE / loc_2(STATE) STATE Ohio / stateid('ohio') 24 Probabilistic Parsing Model d2 CITY CITY capital capital CITY of RIVER Ohio ( loc_2 CITY ( ) RIVER riverid ( ) 'ohio' ) CITY capital CITY / capital(CITY) CITY of RIVER / loc_2(RIVER) RIVER Ohio / riverid('ohio') 25 Probabilistic Parsing Model d1 d2 CITY capital ( loc_2 CITY ( stateid ) capital STATE ( CITY ) 'ohio' loc_2 ) CITY capital CITY / capital(CITY) 0.5 CITY of STATE / loc_2(STATE) 0.3 STATE Ohio / stateid('ohio') 0.5 + ( CITY ( riverid λ Pr(d1|capital of Ohio) = exp( 1.3 ) / Z ) RIVER ( ) 'ohio' ) CITY capital CITY / capital(CITY) 0.5 CITY of RIVER / loc_2(RIVER) 0.05 RIVER Ohio / riverid('ohio') 0.5 + λ Pr(d2|capital of Ohio) = exp( 1.05 ) / Z normalization constant 26 Parsing Model of WASP • • • • N (non-terminals) = {QUERY, CITY, STATE, …} S (start symbol) = QUERY Tm (MRL terminals) = {answer, capital, loc_2, (, ), …} Tn (NL words) = {What, is, the, capital, of, Ohio, …} QUERY What is CITY / answer(CITY) • L (lexicon) = CITY the capital CITY / capital(CITY) CITY of STATE / loc_2(STATE) STATE Ohio / stateid('ohio') • λ (parameters of probabilistic model) 27 Overview of WASP Unambiguous CFG of MRL Lexical acquisition Training set, {(e,f)} Lexicon, L Parameter estimation Training Parsing model parameterized by λ Testing Input sentence, e' Semantic parsing Output MR, f' 28 Lexical Acquisition • Transformation rules are extracted from word alignments between an NL sentence, e, and its correct MR, f, for each training example, (e, f) 29 Word Alignments Le And programme the a program été has mis en been application implemented • A mapping from French words to their meanings expressed in English 30 Lexical Acquisition • Train a statistical word alignment model (IBM Model 5) on training set • Obtain most probable n-to-1 word alignments for each training example • Extract transformation rules from these word alignments • Lexicon L consists of all extracted transformation rules 31 Word Alignment for Semantic Parsing The goalie should always stay in our half ( ( true ) ( do our { 1 } ( pos ( half our ) ) ) ) • How to introduce syntactic tokens such as parens? 32 Use of MRL Grammar RULE (CONDITION DIRECTIVE) The CONDITION (true) goalie should DIRECTIVE (do TEAM {UNUM} ACTION) always TEAM our UNUM 1 stay ACTION (pos REGION) in n-to-1 our REGION (half TEAM) half TEAM our top-down, left-most derivation of an unambiguous CFG 33 Extracting Transformation Rules The goalie should always stay in our TEAM half RULE (CONDITION DIRECTIVE) CONDITION (true) DIRECTIVE (do TEAM {UNUM} ACTION) TEAM our UNUM 1 ACTION (pos REGION) REGION (half TEAM) TEAM our TEAM our / our 34 Extracting Transformation Rules The goalie should always stay in REGION TEAM half RULE (CONDITION DIRECTIVE) CONDITION (true) DIRECTIVE (do TEAM {UNUM} ACTION) TEAM our UNUM 1 ACTION (pos REGION) REGION (half TEAM) our) TEAM our REGION TEAM half / (half TEAM) 35 Extracting Transformation Rules The goalie should always ACTION stay in REGION RULE (CONDITION DIRECTIVE) CONDITION (true) DIRECTIVE (do TEAM {UNUM} ACTION) TEAM our UNUM 1 ACTION (pos (half REGION) our)) REGION (half our) ACTION stay in REGION / (pos REGION) 36 Probabilistic Parsing Model • Based on maximum-entropy model: 1 Pr (d | e) exp i f i (d) Z (e) i • Features fi (d) are number of times each transformation rule is used in a derivation d • Output translation is the yield of most probable derivation f * marg max d Prλ (d | e) 37 Parameter Estimation • Maximum conditional log-likelihood criterion λ* arg max λ log Pr (f | e) λ ( e ,f ) • Since correct derivations are not included in training data, parameters λ* are learned in an unsupervised manner • EM algorithm combined with improved iterative scaling, where hidden variables are correct derivations (Riezler et al., 2000) 38 KRISP: Kernel-based Robust Interpretation by Semantic Parsing • Learns semantic parser from NL sentences paired with their respective MRs given MRL grammar • Productions of MRL are treated like semantic concepts • SVM classifier is trained for each production with string subsequence kernel • These classifiers are used to compositionally build MRs of the sentences 39 Kernel Functions • A kernel K is a similarity function over domain X which maps any two objects x, y in X to their similarity score K(x,y) • For x1, x2 ,…, xn in X, the n-by-n matrix (K(xi,xj))ij should be symmetric and positive-semidefinite, then the kernel function calculates the dot-product of the implicit feature vectors in some highdimensional feature space • Machine learning algorithms which use the data only to compute similarity can be kernelized (e.g. Support Vector Machines, Nearest Neighbor etc.) 40 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = ? 41 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left K(s,t) = 1+? 42 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = our K(s,t) = 2+? 43 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = penalty K(s,t) = 3+? 44 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = area K(s,t) = 4+? 45 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” u = left penalty K(s,t) = 5+? 46 String Subsequence Kernel • Define kernel between two strings as the number of common subsequences between them [Lodhi et al., 2002] • All possible subsequences become the implicit feature vectors and the kernel computes their dot-products s = “left side of our penalty area” t = “our left penalty area” K(s,t) = 11 47 Normalized String Subsequence Kernel • Normalize the kernel (range [0,1]) to remove any bias due to different string lengths K ( s, t ) K normalized(s, t ) K (s, s) * K (t , t ) • Lodhi et al. [2002] give O(n|s||t|) for computing string subsequence kernel • Used for Text Categorization [Lodhi et al, 2002] and Information Extraction [Bunescu & Mooney, 2005b] 48 Support Vector Machines • SVM’s are classifiers that learn linear separators that maximize the margin between data and the classification boundary. • Kernel’s allow SVM’s to learn non-linear separators by implicitly mapping data to a higherdimensional feature space. ρ 49 Overview of KRISP MRL Grammar NL sentences with MRs Collect positive and negative examples Train string-kernel-based SVM classifiers Training Testing Novel NL sentences Best semantic derivations (correct and incorrect) Semantic Parser Best MRs 50 Overview of KRISP’s Semantic Parsing • We first define Semantic Derivation of an NL sentence • We define probability of a semantic derivation • Semantic parsing of an NL sentence involves finding its most probable semantic derivation • Straightforward to obtain MR from a semantic derivation 51 Semantic Derivation of an NL Sentence MR parse with non-terminals on the nodes: ANSWER RIVER answer TRAVERSE_2 traverse_2 STATE NEXT_TO STATE next_to STATEID stateid ‘texas’ Which rivers run through the states bordering Texas? 52 Semantic Derivation of an NL Sentence MR parse with productions on the nodes: ANSWER answer(RIVER) RIVER TRAVERSE_2(STATE) TRAVERSE_2 traverse_2 STATE NEXT_TO(STATE) NEXT_TO next_to STATE STATEID STATEID ‘texas’ Which rivers run through the states bordering Texas? 53 Semantic Derivation of an NL Sentence Semantic Derivation: Each node covers an NL substring: ANSWER answer(RIVER) RIVER TRAVERSE_2(STATE) TRAVERSE_2 traverse_2 STATE NEXT_TO(STATE) NEXT_TO next_to STATE STATEID STATEID ‘texas’ Which rivers run through the states bordering Texas? 54 Semantic Derivation of an NL Sentence Semantic Derivation: Each node contains a production and the substring of NL sentence it covers: (ANSWER answer(RIVER), [1..9]) (RIVER TRAVERSE_2(STATE), [1..9]) (TRAVERSE_2 traverse_2, [1..4]) (STATE NEXT_TO(STATE), [5..9]) (NEXT_TO next_to, [5..7]) (STATE STATEID, [8..9]) (STATEID ‘texas’, [8..9]) Which rivers run through the states bordering Texas? 1 2 3 4 5 6 7 8 9 55 Probability of a Semantic Derivation • Let Pπ(s[i..j]) be the probability that production π covers the substring s[i..j], • For e.g., PNEXT_TO next_to (“the states bordering”) (NEXT_TO next_to, [5..7]) the states bordering 5 6 7 • Obtained from the string-kernel-based SVM classifiers trained for each production π • Probability of a semantic derivation D: P ( D) P (s[i.. j ]) ( ,[ i .. j ])D 56 Computing the Most Probable Semantic Derivation • Implemented by extending Earley’s [1970] context-free grammar parsing algorithm • Dynamic programming algorithm which generates and compactly stores each subtree once. • Does a greedy approximation search, with beam width ω, and returns ω most probable derivations it finds. 57 KRISP’s Training Algorithm • Takes NL sentences paired with their respective MRs as input • Obtains MR parses • Proceeds in iterations • In the first iteration, for every production π: – Call those sentences positives whose MR parses use that production – Call the remaining sentences negatives 58 KRISP’s Training Algorithm contd. First Iteration STATE NEXT_TO(STATE) Positives Negatives •which rivers run through the states bordering texas? •what state has the highest population ? •what is the most populated state bordering oklahoma ? •which states have cities named austin ? •what states does the delaware river run through ? •what is the largest city in states that border california ? •what is the lowest point of the state with the largest area ? … … String-kernel-based SVM classifier PSTATENEXT_TO(STATE) (s[i..j]) 59 KRISP’s Training Algorithm contd. • Using these classifiers Pπ(s[i..j]), obtain the ω best semantic derivations of each training sentence • Some of these derivations will give the correct MR, called correct derivations, some will give incorrect MRs, called incorrect derivations • For the next iteration, collect positives from most probable correct derivation • Collect negatives from incorrect derivations with higher probability than the most probable correct derivation 60 KRISP’s Training Algorithm contd. Next Iteration STATE NEXT_TO(STATE) Positives Negatives •the states bordering texas? •what state has the highest population ? •state bordering oklahoma ? •what states does the delaware river run through ? •states that border california ? •which states have cities named austin ? •states which share border •what is the lowest point of the state with the largest area ? •next to state of iowa •which rivers run through states bordering … … String-kernel-based SVM classifier PSTATENEXT_TO(STATE) (s[i..j]) 61 Experimental Corpora • CLang – 300 randomly selected pieces of coaching advice from the log files of the 2003 RoboCup Coach Competition – 22.52 words on average in NL sentences – 14.24 tokens on average in formal expressions • GeoQuery [Zelle & Mooney, 1996] – 250 queries for the given U.S. geography database – 6.87 words on average in NL sentences – 5.32 tokens on average in formal expressions 62 Experimental Methodology • Evaluated using standard 10-fold cross validation • Correctness – CLang: output exactly matches the correct representation – Geoquery: the resulting query retrieves the same answer as the correct representation • Metrics | Correct Completed Parses | Precision | Completed Parses | |Correct Completed Parses| Recall |Sentences| 63 Precision Learning Curve for CLang 64 Recall Learning Curve for CLang 65 Precision Learning Curve for GeoQuery 66 Recall Learning Curve for Geoquery 67 Future Work • Explore methods that can automatically generate SAPTs to minimize the annotation effort for SCISSOR. • Learning semantic parsers just from sentences paired with “perceptual context.” 68 Conclusions • Learning semantic parsers is an important and challenging problem in natural-language learning. • We have obtained promising results on several applications using a variety of approaches with different strengths and weaknesses. • Not many others have explored this problem, I would encourage others to consider it. • More and larger corpora are needed for training and testing semantic parser induction. 69 Thank You! Our papers on learning semantic parsers are on-line at: http://www.cs.utexas.edu/~ml/publication/lsp.html Our corpora can be downloaded from: http://www.cs.utexas.edu/~ml/nldata.html Questions?? 70