Probabilistic and Lexicalized Parsing Probabilistic CFGs • Weighted CFGs – Attach weights to rules of CFG – Compute weights of derivations – Use weights to pick, preferred parses • Utility: Pruning and ordering the search space, disambiguate, Language Model for ASR. • Parsing with weighted grammars (like Weighted FA) – T* = arg maxT W(T,S) • Probabilistic CFGs are one form of weighted CFGs. Probability Model • Rule Probability: – Attach probabilities to grammar rules – Expansions for a given non-terminal sum to 1 R1: VP V .55 R2: VP V NP R3: VP V NP NP .40 .05 – Estimate the probabilities from annotated corpora P(R1)=counts(R1)/counts(VP) • Derivation Probability: – – – – Derivation T= {R1…Rn} Probability of a derivation: Most likely probable parse: Probability of a sentence: n P (T ) P ( Ri ) i 1 T * arg max P(T ) T P( S ) P(T | S ) T • Sum over all possible derivations for the sentence • Note the independence assumption: Parse probability does not change based on where the rule is expanded. Structural ambiguity • • • • • • NP John | Mary | Denver • V -> called • P -> from S NP VP VP V NP NP NP PP VP VP PP PP P NP John called Mary from Denver S S VP NP NP PP VP V John called VP NP P NP NP Mary from Denver John V NP called Mary PP P NP from Denver Cocke-Younger-Kasami Parser • • • • Bottom-up parser with top-down filtering Start State(s): (A, i, i+1) for each Awi+1 End State: (S, 0,n) n is the input size Next State Rules – (B, i, k) (C, k, j) (A, i, j) if ABC Example John called Mary from Denver Base Case: Aw NP P NP V NP John called Mary from Denver Recursive Cases: ABC NP P NP X V NP called John Mary from Denver NP P VP NP X V Mary NP called John from Denver NP X P VP NP from X V Mary NP called John Denver PP NP X P Denver VP NP from X V Mary NP called John S NP John PP NP X P Denver VP NP from V Mary called PP NP Denver X X P S VP NP from X V Mary NP called John NP X S VP NP X V Mary NP called John PP NP P Denver from NP PP NP Denver X X X P S VP NP from X V Mary NP called John VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John NP PP NP X VP1 VP2 X X P Denver S VP NP from X V Mary NP called John S NP PP NP X VP1 VP2 X X P Denver S VP NP from X V Mary NP called John S VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John Probabilistic CKY • Assign probabilities to constituents as they are completed and placed in the table • Computing the probability P( A, i, j ) P( A BC , i, j) A BC P( A BC , i, j ) P( B, i, k ) *P(C , k , j )* P( A BC ) – Since we are interested in the max P(S,0,n) • Use the max probability for each constituent • Maintain back-pointers to recover the parse. Problems with PCFGs • The probability model we’re using is just based on the rules in the derivation. • Lexical insensitivity: – Doesn’t use the words in any real way – Structural disambiguation is lexically driven • PP attachment often depends on the verb, its object, and the preposition • I ate pickles with a fork. • I ate pickles with relish. • Context insensitivity of the derivation – Doesn’t take into account where in the derivation a rule is used • Pronouns more often subjects than objects • She hates Mary. • Mary hates her. • Solution: Lexicalization – Add lexical information to each rule An example of lexical information: Heads • Make use of notion of the head of a phrase – Head of an NP is a noun – Head of a VP is the main verb – Head of a PP is its preposition • Each LHS of a rule in the PCFG has a lexical item • Each RHS non-terminal has a lexical item. – One of the lexical items is shared with the LHS. • If R is the number of binary branching rules in CFG, in lexicalized CFG: O(2*|∑|*|R|) • Unary rules: O(|∑|*|R|) Example (correct parse) Attribute grammar Example (less preferred) Computing Lexicalized Rule Probabilities • We started with rule probabilities – VP V NP PP P(rule|VP) • E.g., count of this rule divided by the number of VPs in a treebank • Now we want lexicalized probabilities – VP(dumped) V(dumped) NP(sacks)PP(in) – P(rule|VP ^ dumped is the verb ^ sacks is the head of the NP ^ in is the head of the PP) – Not likely to have significant counts in any treebank Another Example • Consider the VPs – Ate spaghetti with gusto – Ate spaghetti with marinara • Dependency is not between mother-child. Vp (ate) Vp(ate) Pp(with) np v Ate spaghetti with gusto Vp(ate) Np(spag) np Pp(with) v Ate spaghetti with marinara Log-linear models for Parsing • Why restrict to the conditioning to the elements of a rule? – Use even larger context – Word sequence, word types, sub-tree context etc. • In general, compute P(y|x); where fi(x,y) test the properties of the context; li is the weight of that feature. l * f ( x, y ) P( y | x) e i i li * f i ( x , y ) e yY • Use these as scores in the CKY algorithm to find the best scoring parse. Supertagging: Almost parsing Poachers now control the underground trade N S VP N N Adv VP S NP NP V NP control e S N Adv poachers now S V e NP VP NP NP VP poachers now Det NP : : Adj trade N NP VP NP control N N trade S NP S N underground S V : : e Adj control Adv N N N V VP NP underground the NP VP V S S NP NP VP NP VP now poachers S S S NP VP e V e NP NP VP V NP e N Adj underground : trade Summary • Parsing context-free grammars – Top-down and Bottom-up parsers – Mixed approaches (CKY, Earley parsers) • Preferences over parses using probabilities – Parsing with PCFG and PCKY algorithms • Enriching the probability model – Lexicalization – Log-linear models for parsing