Probabilistic and Lexicalized Parsing CS 4705 Probabilistic CFGs: PCFGs • Weighted CFGs – Attach weights to rules of CFG – Compute weights of derivations – Use weights to choose preferred parses • Utility: Pruning and ordering the search space, disambiguate, Language Model for ASR • Parsing with weighted grammars: find the parse T’ which maximizes the weights of the derivations in the parse tree for all the possible parses of S • T’(S) = argmaxT∈τ(S) W(T,S) • Probabilistic CFGs are one form of weighted CFGs Rule Probability • Attach probabilities to grammar rules • Expansions for a given non-terminal sum to 1 R1: VP V .55 R2: VP V NP .40 R3: VP V NP NP .05 • Estimate probabilities from annotated corpora – E.g. Penn Treebank – P(R1)=counts(R1)/counts(VP) Derivation Probability • For a derivation T= {R1…Rn}: – Probability of the derivation: n P (T ) P ( Ri ) i 1 • Product of probabilities of rules expanded in tree – Most likely probable parse: – Probability of a sentence: T * arg max P(T ) T P( S ) P(T | S ) T • Sum over all possible derivations for the sentence • Note the independence assumption: Parse probability does not change based on where the rule is expanded. One Approach: CYK Parser • Bottom-up parsing via dynamic programming – Assign probabilities to constituents as they are completed and placed in a table – Use the maximum probability for each constituent type going up the tree to S • The Intuition: – We know probabilities for constituents lower in the tree, so as we construct higher level constituents we don’t need to recompute these CYK (Cocke-Younger-Kasami) Parser • Bottom-up parser with top-down filtering • Uses dynamic programming to store intermediate results (cf. Earley algorithm for top-down case) • Input: PCFG in Chomsky Normal Form – Rules of form Aw or ABC; no ε • Chart: array [i,j,A] to hold probability that non-terminal A spans input i-j – Start State(s): (i,i+1,A) for each Awi+1 – End State: (1,n,S) where n is the input size – Next State Rules: (i,k,B) (k,j,C) (i,j,A) if ABC • Maintain back-pointers to recover the parse Structural Ambiguity • • • • • • NP John | Mary | Denver • V -> called • P -> from S NP VP VP V NP NP NP PP VP VP PP PP P NP John called Mary from Denver S S VP NP NP PP VP V John called VP NP P NP NP Mary from Denver John V NP called Mary PP P NP from Denver Example John called Mary from Denver Base Case: Aw NP P NP V NP John called Mary from Denver Recursive Cases: ABC NP P NP X V NP called John Mary from Denver NP P VP NP X V Mary NP called John from Denver NP X P VP NP from X V Mary NP called John Denver PP NP X P Denver VP NP from X V Mary NP called John S NP John PP NP X P Denver VP NP from V Mary called PP NP Denver X X P S VP NP from X V Mary NP called John NP X S VP NP X V Mary NP called John PP NP P Denver from NP PP NP Denver X X X P S VP NP from X V Mary NP called John VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John NP PP NP X VP1 VP2 X X P Denver S VP NP from X V Mary NP called John S NP PP NP X VP1 VP2 X X P Denver S VP NP from X V Mary NP called John S VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John Problems with PCFGs • Probability model just based on rules in the derivation. • Lexical insensitivity: – Doesn’t use words in any real way – But structural disambiguation is lexically driven • PP attachment often depends on the verb, its object, and the preposition • I ate pickles with a fork. • I ate pickles with relish. • Context insensitivity of the derivation – Doesn’t take into account where in the derivation a rule is used • Pronouns more often subjects than objects • She hates Mary. • Mary hates her. • Solution: Lexicalization – Add lexical information to each rule – I.e. Condition the rule probabilities on the actual words An example: Phrasal Heads • Phrasal heads can ‘take the place of’ whole phrases, defining most important characteristics of the phrase • Phrases generally identified by their heads – Head of an NP is a noun, of a VP is the main verb, of a PP is preposition • Each PFCG rule’s LHS shares a lexical item with a non-terminal in its RHS Increase in Size of Rule Set in Lexicalized CFG • If R is the number of binary branching rules in CFG and ∑ is the lexicon, O(2*|∑|*|R|) • For unary rules: O(|∑|*|R|) Example (correct parse) Attribute grammar Example (less preferred) Computing Lexicalized Rule Probabilities • We started with rule probabilities as before – VP V NP PP P(rule|VP) • E.g., count of this rule divided by the number of VPs in a treebank • Now we want lexicalized probabilities – VP(dumped) V(dumped) NP(sacks) PP(into) • i.e., P(rule|VP ^ dumped is the verb ^ sacks is the head of the NP ^ into is the head of the PP) – Not likely to have significant counts in any treebank Exploit the Data You Have • So, exploit the independence assumption and collect the statistics you can… • Focus on capturing – Verb subcategorization • Particular verbs have affinities for particular VPs – Objects’ affinity for their predicates • Mostly their mothers and grandmothers • Some objects fit better with some predicates than others Verb Subcategorization • Condition particular VP rules on their heads – E.g. for a rule r VP -> V NP PP • P(r|VP) becomes P(r ^ V=dumped | VP ^ dumped) – How do you get the probability? • • How many times was rule r used with dumped, divided by the number of VPs that dumped appears in, in total How predictive of r is the verb dumped? – Captures affinity between VP heads (verbs) and VP rules Example (correct parse) Example (less preferred) Affinity of Phrasal Heads for Other Heads: PP Attachment • Verbs with preps vs. Nouns with preps • E.g. dumped with into vs. sacks with into – How often is dumped the head of a VP which includes a PP daughter with into as its head relative to other PP heads or… what’s P(into|PP,dumped is mother VP’s head)) – Vs…how often is sacks the head of an NP with a PP daughter whose head is into relative to other PP heads or… P(into|PP,sacks is mother’s head)) But Other Relationships do Not Involve Heads (Hindle & Rooth ’91) • Affinity of gusto for eat is greater than for spaghetti; and affinity of marinara for spaghetti is greater than for ate Vp (ate) Vp(ate) Pp(with) np v Ate spaghetti with gusto Vp(ate) Np(spag) np Pp(with) v Ate spaghetti with marinara Log-linear models for Parsing • Why restrict to the conditioning to the elements of a rule? – Use even larger context…word sequence, word types, sub-tree context etc. • Compute P(y|x); where fi(x,y) tests properties of context and li is weight of feature e i i P( y | x) li * f i ( x , y ) e l * f ( x, y ) yY • Use as scores in CKY algorithm to find best parse Supertagging: Almost parsing Poachers now control the underground trade N S VP N N Adv VP S NP NP V NP control e S N Adv poachers now S V e NP VP NP NP VP poachers now Det NP : : Adj trade N NP VP NP control N N trade S NP S N underground S V : : e Adj control Adv N N N V VP NP underground the NP VP V S S NP NP VP NP VP now poachers S S S NP VP e V e NP NP VP V NP e N Adj underground : trade Summary • Parsing context-free grammars – Top-down and Bottom-up parsers – Mixed approaches (CKY, Earley parsers) • Preferences over parses using probabilities – Parsing with PCFG and PCKY algorithms • Enriching the probability model – Lexicalization – Log-linear models for parsing – Super-tagging