Statistical Parsing Xinyu Dai 2015-11 2016/7/15 1 Outline Review parsing and Context-FreeGrammar(CFG) Ambiguities in Natural language parsing Probabilistic CFG 2016/7/15 2 Parsing Programming Language: Analysis a token sequence to determine its structure according to the grammar Context-Free-Grammar Why parsing: Report syntax error, recovering error for further processing Constitute a parse tree to the rest of a compiler …… Methods: Top-down and bottom-up Ambiguities: look ahead…… 2016/7/15 3 Parsing Natural Language: Analysis a sentence to determine its structure according to the grammar Context-Free-Grammar Why parsing: The parsing tree can represent the structure of a sentence Phrase Relation: between phrases and in a phrase Improving many nlp applications High precision question answering systems (Pasca and Harabagiu SIGIR 2001) Improving biological named entity extraction (Finkel et al. JNLPBA 2004): Syntactically based sentence compression (Lin and Wilbur Inf. Retr. 2007) Extracting people’s opinions about products (Bloom et al. NAACL 2007) Methods: top-down or bottom-up 2016/7/15 4 Context-Free-Grammar (CFG) G = (T, N ,S ,R) T is set of terminals N is set of non-terminals S is the start symbol (one of the nonterminals) R is rules/productions of the form X γ, where X is a nonterminal and is a sequence of terminals and nonterminals (possibly an empty sequence) A grammar G generates a language L. 2016/7/15 5 A phrase structure grammar S → NP VP VP → V NP VP→V NP PP NP→NP PP NP→N NP→e NP → N N PP → P NP N → cats N → claws N→people N→scratch V→scratch P→with 2016/7/15 6 Top down parsing Top-down parsing is goal directed A top-down parser starts with a list of constituents to be built. The top-down parser rewrites the goals in the goal list by matching one against the LHS of the grammar rules, and expanding it with the RHS, attempting to match the sentence to be derived. If a goal can be rewritten in several ways, then there is a choice of which rule to apply (search problem) Can use depth-first or breadth-first search, and goal ordering. 2016/7/15 7 Top down parsing 2016/7/15 8 Bottom-up parsing Bottom-up parsing is data directed The initial goal list of a bottom-up parser is the string to be parsed. If a sequence in the goal list matches the RHS of a rule, then this sequence may be replaced by the LHS of the rule. Parsing is finished when the goal list contains just the start category. If the RHS of several rules match the goal list, then there is a choice of which rule to apply (search problem) Can use depth-first or breadth-first search, and goal ordering. The standard presentation is as shift-reduce parsing. 2016/7/15 9 Shift-reduce parsing: one path 2016/7/15 10 Problems with top-down or bottom-up Left-recursive Many different rules for the same LHS Locally possible, globally impossible Repeated work: anywhere there is common substructure. …… 2016/7/15 11 Repeated work… 2016/7/15 12 Classical NLP Parsing Wrote symbolic grammar and lexicon Used top-down or bottom up methods to parse 2016/7/15 13 Classical NLP Parsing Wrote symbolic grammar and lexicon Used top-down or bottom up methods to parse 2016/7/15 14 Classical NLP Parsing Wrote symbolic grammar and lexicon Used top-down or bottom up methods to parse 2016/7/15 15 Classical NLP Parsing Wrote symbolic grammar and lexicon Used top-down or bottom up methods to parse Minimal grammar on “Fed raises” sentence : 36 parses Real-size broad-coverage grammar: millions of parser 2016/7/15 16 Ambiguities: PP Attachment 2016/7/15 17 Classical NLP Parsing: The problem and its solution Very constrained grammars attempt to limit unlikely/weird parses for sentences A less constrained grammar can parse more sentences But the attempt make the grammars not robust: many sentences have no parse But simple sentences end up with ever more parses Solution: We need mechanisms that allow us to find the most likely parse(s) Statistical parsing lets us work with very loose grammars that admit millions of parses for sentences but to still quickly find the best parse(s) 2016/7/15 18 Probabilistic or stochastic context-free grammars (PCFG) G = (T, N ,S ,R,P) T is set of terminals N is set of non-terminals S is the start symbol (one of the nonterminals) R is rules/productions of the form X γ, where X is a nonterminal and is a sequence of terminals and nonterminals (possibly an empty sequence) P(R) gives the probability of each rule. "x Î N, å P(x ® g ) =1 x®g A grammar G generates a language model L. å P(g ) = 1 g ÎT* 2016/7/15 19 PCFG - Notation w1n=w1…wn = the word sequence from 1 to n (sentence of length n) wab = the subsequence wa … wb Njab = the nonterminal Nj dominating wa … wb We will write P(N i ® z j ) to mean P(N i ® z j | N i ) We will want to calculate maxt P(t | wab ) = maxt P(t Þ *wab ) 2016/7/15 20 The probability of trees and strings P(t) -- The probability of tree is the product of the probabilities of the rules used to generate it. P(w1n) -- The probability of the string is the sum of the probabilities of the trees which have that string as their yield P(w1n ) = å P(w1n , t j ) where t j is a parse of w1n j = å P(t j ) j 2016/7/15 21 A Simple PCFG 2016/7/15 22 Trees 2016/7/15 23 Tree and String Probabilities w15 = astronomers saw stars with ears P(t1) =1.0*0.1*0.7*1.0*0.4*0.18 * 1.0 * 1.0 * 0.18 = 0.0009072 P(t2) =1.0*0.1*0.3*0.7*1.0*0.18 * 1.0 * 1.0 * 0.18 = 0.0006804 P(w15) = P(t1) + P(t2) = 0.0009072 + 0.0006804 = 0.0015876 2016/7/15 24 Have a rest Grammar? Probability(Learning)? Decoding? …… 2016/7/15 25 Grammar - Treebank 2016/7/15 26 Treebank Going into it, building a treebank seems a lot slower and less useful than building a grammar But a treebank gives us many things Reusability of the labor Broad coverage Frequencies and distributional information A way to evaluate systems Treebank grammars can be enormous the raw grammar has ~10K states, excluding the lexicon Better parsers usually make the grammars larger, not smaller 2016/7/15 27 Treebank Grammars Can take a grammar right off the trees 2016/7/15 28 Chomsky Normal Form All rules of the form X → Y Z or X → w N-ary rules introduce new non-terminals In practice it’s kind of a pain: Unaries / empties are “promoted” Reconstructing n-aries is easy Reconstructing unaries is trickier The straightforward transformations don’t preserve tree scores Makes parsing algorithms simpler! 2016/7/15 29 Have a rest Grammar? Probability(Learning)? Decoding? …… 2016/7/15 30 Parsing Decoding argmaxtP(t|W1n,G) Learning probabilities argmaxGP(w1n|G) (MLE) P(w1n | G) = å P(w1n , t j | G) where t j is a parse of w1n j = å P(t j | G) j In general, we cannot efficiently calculate the probability of a string by simply summing the probabilities of all possible parse trees for the sentence, as there will be exponentially many of them. Inside algorithm and outside algorithm are efficient ways. 2016/7/15 31 Inside algorithm Inside Variable: αi,j(A) =P(wij|A,G) Inside Probability: the probability of a particular (sub)string, given that the string is dominated by a particular non-terminal. S A→BC A αi,j(A) B C W1 …Wi-1 Wi … Wk Wk+1 … Wj Wj+1 …Wn *We will only consider the case of Chomsky Normal Form Grammars, which only have unary and binary rules of the form 32 Inside probability calculation i , j ( A) P( wi ...w j | A) i<j P( w ...w , B, w i k k 1 ...w j , C | A) A→BC B ,C , k Hypothesis: Place invariance; Context-Free Ancestor-Free P( B, C | A) P(w ...w | A, B, C ) P( wk 1...w j | wi ...wk , A, B, C ) P( B, C | A) P(w ...w | B) P( wk 1...w j | C ) i k B ,C , k i k B ,C , k P( A BC ) i ,k ( B) k 1, j (C ) B ,C , k i , j ( A) P( A wi ) i j 33 Find probability of a string using inside probability Input:G,string W=w1w2…wn Output:P(w1w2…wn|G)= α1,n(S) Algorithm: Initialization: i ,i ( A) P( A wi ), A N , 1 i n Induction: calculate i , j ( A) Termination P( A BC ) B ,CN i k j i ,k ( B) k 1, j (C ) P( S w1...wn | G ) 1,n ( S ) With inside probability, we get the probability of W through the sum of all possible parse trees generated from S. 34 Outside Probability Outside Variable: βi,j(A) =P(w1(i-1),A,w(j+1)n|G) Outside probability: the probability of generating a particular non-terminal and the sequences of words outside its domination (starting with the start nonterminal). S C B S A W1 …Wh-1 Wh … Wi-1 Wi … Wj Wj+1 …Wn C A B W1 …Wi-1 Wi … Wj Wj+1 … Wk Wk+1 …Wn 35 Outside probability calculation 1, A S 1,n ( A) 0, A S i , j ( A) P( w1...wi 1 , A, w j 1...wn | G ) B ,C , j k P( w1...wi 1 , C , wk 1...wn )P(C AB) P( B w j 1...wk ) B ,C , h i B ,C , j k P( w1...wh 1 , C , w j 1...wn )P(C BA) P( B wh ...wi 1 ) i ,k (C )P(C AB ) j 1,k ( B ) B ,C , h i h , j (C )P (C BA) h ,i 1 ( B ) 36 Find probability of a string using with outside probability P( w1n | G ) P( w1...wk 1 Awk 1...wn | G ) A, k k ,k ( A) P( A wk ) A, k 37 Viterbi algorithm argmaxtP(t|W1n,G) Viterbi variable: δi,j(A) δi,j(A) is the highest inside probability parse of subtree A. 38 Find the most probable parse using Viterbi algorithm argmaxtP(t|W1n,G) Viterbi variable: δi,j(A) δi,j(A) is the highest inside probability parse of subtree A. 39 Find the most probable parse using Viterbi algorithm Input:G,string W=w1w2…wn Output:t* (The most probable parse tree of W) Algorithm: Initialization: ( A) P( A w ), i ,i i Induction:1≤j≤n,1≤i≤n-j, calculate i ,i j ( A) max B ,CN ;i k i j A N, 1 i n P( A BC )i ,k ( B) k 1,i j (C ) i ,i j ( A) arg max P( A BC)i ,k ( B) k 1,i j (C) B ,CN ;i k i j Termination P(t * ) 1,n (S ) S is the designated start symbol of t*, get the t* through tracing back Δ1,n(S) 40 Have a rest Grammar? Probability(Learning)? Decoding? …… 2016/7/15 41 Learning Probabilities Maximum likelihood estimation Training from Treebank Relative Frequence If we have no treebank Expectation maximum estimation – Inside&outside probability 2016/7/15 42 Relative Frequence From a treebank To get the probability for a particular VP rule just count all the times the rule is used and divide by the number of VPs overall Estimate rule probability: Number ( A ) P( A ) Number ( A ) P(VP Vi ) Number (VP Vi ) Number (VP Vi ) Number (VP VT NP) Number (VP VP PP) 43 S → NP VP 3 NP → Pro 2 NP → Det N 6 NP → NP PP 1 VP → V NP 3 VP → VP PP PP → P NP 1 2 S → NP VP 3 / 3 = 1.0 NP → Pro 2 / 9 = 0.22 NP → Det N 6 / 9 = 0.67 NP → NP PP 1 / 9 = 0.11 VP → V NP 3 / 4 = 0.75 VP → VP PP PP → P NP 1 / 4 = 0.25 2 / 2 = 1.0 S → NP VP 1.0 NP → Pro 0.22 NP → Det N 0.67 NP → NP PP 0.11 VP → V NP 0.75 VP → VP PP PP → P NP 0.25 1.0 Expectation maximum estimation – Inside&outside probability Estimate rule probability through EME: Expected occurrence of A→BC is the cooccurrence of symbol A、B and C. Expected Number ( A ) P( A ) Expected Number ( A ) An efficient method for estimating probabilities for rules from unparsed data is known the inside-outside algorithm. Like the HMM backwards/forwards training procedure, it is an analogous application of Welch-Baum algorithm. 47 Rule probability with EME Number ( A BC ) P( A BC ) Number ( A ) 1i k j n i , j ( A) P( A BC ) i ,k ( B) k 1, j (C ) 1i j n i , j ( A) i , j ( A) 48 Algorithm of expected maximum estimation Initialization: put a random value to each P(A→α), ensure P( A ) 1 , PCFG Gi,i=0 Repeat following step,until convergence a Calculate the expected occurrences of each rule Re-estimation each rule probability according to the expected occurrences. Get the new PCFG Gi+1 49 Review Decoding argmaxtP(t|W1n,G) Viterbi variable: δi,j(A) δi,j(A) is the highest inside probability parse of subtree A. 2016/7/15 50 A Recursive Parser Will this parser work? Why or why not? Memory requirements? 2016/7/15 51 A Memorized Parser Question:if there some rules like X -> Y, how? 2016/7/15 52 A Bottom-Up Parser (CKY) 2016/7/15 53 Memory How much memory does this require? Have to store the score cache Cache size: |symbols|*n2 doubles For the plain treebank grammar: X ~ 20K, n = 40, double ~ 8 bytes, ~256MB Big, but workable. Pruning: Beams score[X][i][j] can get too large Can keep beams (truncated maps score[i][j]) which only store the best few scores for the span [i,j] 2016/7/15 54 Beam Search 2016/7/15 55 Time How much time will it take to parse? For each diff (<= n) For each i (<= n) For each rule X -> Y Z For each split point k Total time: |rules|*n3 Something like 5 sec for an unoptimized parse of a 20-word sentences 2016/7/15 56 Evaluating Parsing Accuracy Most sentences are not given a completely correct parse by any currently existing parsers. Standardly for Penn Treebank parsing, evaluation is done in terms of the percentage of correct constituents (labeled spans). [ label, start, end ] A constituent is a triple, all of which must be in the true parse for the constituent to be marked correct. 2016/7/15 57 Evaluating Constituent Accuracy Let C be the number of correct constituents produced by the parser over the test set, M be the total number of constituents produced, and N be the total in the correct version Precision = C/M Recall = C/N Both are important measures. Parser Evaluation Using the Penn Treebank Labeled Precision and Recall of constituents A parser says: The Treebank says: Counting Constituents S NP VP NP PP NP 1 1 2 3 5 6 7 1 7 4 7 7 S NP VP NP NP PP NP 1 1 2 3 3 5 6 7 1 7 7 4 7 7 Review the probability of a parsing tree The probability of tree is the product of the probabilities of the rules used to generate it. Not every NP expansion can fill every NP slot A grammar with symbols like “NP” won’t be context-free Statistically, conditional independence too strong 2016/7/15 61 Non-Independence Independence assumptions are often too strong. Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects). Also: the subject and object expansions are correlated! 2016/7/15 62 Grammar Refinement The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation[Johnson ’98, Klein and Manning 03] 2016/7/15 63 Grammar Refinement The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Head lexicalization [Collins ’99, Charniak ’00] 2016/7/15 64 Lexicalized PCFG p(V NP NP|VP) p(V NP NP|VP(said)) p(V NP NP|VP(gave)) = 0.00151 = 0.00001 = 0.01980 65 Lexicalized parsing Central to lexicalization model is the idea that the strong lexical dependencies are between heads and their dependents. E.g., “hit the girl with long hair with a hammer” We might expect dependencies between “girl” and “hair” “hit” and “hammar” (And “long” and “hair”, although this is not so useful here.) Heads in CFGs Each rule has a head The head is one of the items in the RHS What is the head? Hard to define, but roughly The “most important” piece of the constituent The sub-constituent that carries the meaning of the constituent The notion of headedness appears in LFG, X-bar syntax, etc. 66 Head Example Rules: S -> NP VP NP -> Pro VP -> V NP NP -> Det N 67 Lexicalization PCFG When we extract a PCFG from the Penn Treebank, we don’t know the heads of the rules We can use a set of heuristics to find the head of each rule Examples from a Head-Table for Penn Treebank grammars To find the head of VP: If the VP contains a V, the head is the leftmost V; Else, if the VP contains a MD, the head is the leftmost MD; Else, if the VP contains a VP, the head is the leftmost VP; Else, the head is the leftmost child 68 Lexicalization PCFG Once we know the head of each rule, we can determine the headword of each constituent Start from the bottom The headword of a constituent is the headword of its head 69 Exploiting Heads Condition the probability of rule expansion on the head. Condition the probability of the head on the type of phrase the head of the “mother phrase” In effect, the probability of a parse is some version of Π P(head|phead)P(rule|head) over all the constituents of a parse, which we use to chose the most probable parse. 70 Charniak’s Approach of Lexical PCFG Parsing T arg max P(t | s) t P(t | s) P(r (n)) Normal PCFG nt P(t | s) P(r (n) | c(n), h(n))* P(h(n) | c(n), h(m(n))) nt 71 Reference Slides from Stanford Slides from UC Berkeley http://nlp.stanford.edu/courses/lsa354/ http://www.cs.berkeley.edu/~klein/cs288/sp10/ A useful tools: NLTK http://nltk.org/api/nltk.parse.html 2016/7/15 72