Probabilistic and Lexicalized Parsing CS 4705

Probabilistic and Lexicalized Parsing CS 4705 Probabilistic CFGs: PCFGs • Weighted CFGs – Attach weights to rules of CFG – Compute weights of derivations – Use weights to choose preferred parses • Utility: Pruning and ordering the search space, disambiguate, Language Model for ASR • Parsing with weighted grammars: find the parse T’ which maximizes the weights of the derivations in the parse tree for all the possible parses of S • T’(S) = argmaxT∈τ(S) W(T,S) • Probabilistic CFGs are one form of weighted CFGs Rule Probability • Attach probabilities to grammar rules • Expansions for a given non-terminal sum to 1 R1: VP  V .55 R2: VP  V NP .40 R3: VP  V NP NP .05 • Estimate probabilities from annotated corpora – E.g. Penn Treebank – P(R1)=counts(R1)/counts(VP) Derivation Probability • For a derivation T= {R1…Rn}: – Probability of the derivation: n P (T )   P ( Ri ) i 1 • Product of probabilities of rules expanded in tree – Most likely probable parse: – Probability of a sentence: T *  arg max P(T ) T P( S )   P(T | S ) T • Sum over all possible derivations for the sentence • Note the independence assumption: Parse probability does not change based on where the rule is expanded. One Approach: CYK Parser • Bottom-up parsing via dynamic programming – Assign probabilities to constituents as they are completed and placed in a table – Use the maximum probability for each constituent type going up the tree to S • The Intuition: – We know probabilities for constituents lower in the tree, so as we construct higher level constituents we don’t need to recompute these CYK (Cocke-Younger-Kasami) Parser • Bottom-up parser with top-down filtering • Uses dynamic programming to store intermediate results (cf. Earley algorithm for top-down case) • Input: PCFG in Chomsky Normal Form – Rules of form Aw or ABC; no ε • Chart: array [i,j,A] to hold probability that non-terminal A spans input i-j – Start State(s): (i,i+1,A) for each Awi+1 – End State: (1,n,S) where n is the input size – Next State Rules: (i,k,B) (k,j,C)  (i,j,A) if ABC • Maintain back-pointers to recover the parse Structural Ambiguity • • • • • • NP  John | Mary | Denver • V -> called • P -> from S  NP VP VP  V NP NP  NP PP VP  VP PP PP  P NP John called Mary from Denver S S VP NP NP PP VP V John called VP NP P NP NP Mary from Denver John V NP called Mary PP P NP from Denver Example John called Mary from Denver Base Case: Aw NP P NP V NP John called Mary from Denver Recursive Cases: ABC NP P NP X V NP called John Mary from Denver NP P VP NP X V Mary NP called John from Denver NP X P VP NP from X V Mary NP called John Denver PP NP X P Denver VP NP from X V Mary NP called John S NP John PP NP X P Denver VP NP from V Mary called PP NP Denver X X P S VP NP from X V Mary NP called John NP X S VP NP X V Mary NP called John PP NP P Denver from NP PP NP Denver X X X P S VP NP from X V Mary NP called John VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John NP PP NP X VP1 VP2 X X P Denver S VP NP from X V Mary NP called John S NP PP NP X VP1 VP2 X X P Denver S VP NP from X V Mary NP called John S VP NP PP NP X X X P Denver S VP NP from X V Mary NP called John Problems with PCFGs • Probability model just based on rules in the derivation. • Lexical insensitivity: – Doesn’t use words in any real way – But structural disambiguation is lexically driven • PP attachment often depends on the verb, its object, and the preposition • I ate pickles with a fork. • I ate pickles with relish. • Context insensitivity of the derivation – Doesn’t take into account where in the derivation a rule is used • Pronouns more often subjects than objects • She hates Mary. • Mary hates her. • Solution: Lexicalization – Add lexical information to each rule – I.e. Condition the rule probabilities on the actual words An example: Phrasal Heads • Phrasal heads can ‘take the place of’ whole phrases, defining most important characteristics of the phrase • Phrases generally identified by their heads – Head of an NP is a noun, of a VP is the main verb, of a PP is preposition • Each PFCG rule’s LHS shares a lexical item with a non-terminal in its RHS Increase in Size of Rule Set in Lexicalized CFG • If R is the number of binary branching rules in CFG and ∑ is the lexicon, O(2*|∑|*|R|) • For unary rules: O(|∑|*|R|) Example (correct parse) Attribute grammar Example (less preferred) Computing Lexicalized Rule Probabilities • We started with rule probabilities as before – VP  V NP PP P(rule|VP) • E.g., count of this rule divided by the number of VPs in a treebank • Now we want lexicalized probabilities – VP(dumped)  V(dumped) NP(sacks) PP(into) • i.e., P(rule|VP ^ dumped is the verb ^ sacks is the head of the NP ^ into is the head of the PP) – Not likely to have significant counts in any treebank Exploit the Data You Have • So, exploit the independence assumption and collect the statistics you can… • Focus on capturing – Verb subcategorization • Particular verbs have affinities for particular VPs – Objects’ affinity for their predicates • Mostly their mothers and grandmothers • Some objects fit better with some predicates than others Verb Subcategorization • Condition particular VP rules on their heads – E.g. for a rule r VP -> V NP PP • P(r|VP) becomes P(r ^ V=dumped | VP ^ dumped) – How do you get the probability? • • How many times was rule r used with dumped, divided by the number of VPs that dumped appears in, in total How predictive of r is the verb dumped? – Captures affinity between VP heads (verbs) and VP rules Example (correct parse) Example (less preferred) Affinity of Phrasal Heads for Other Heads: PP Attachment • Verbs with preps vs. Nouns with preps • E.g. dumped with into vs. sacks with into – How often is dumped the head of a VP which includes a PP daughter with into as its head relative to other PP heads or… what’s P(into|PP,dumped is mother VP’s head)) – Vs…how often is sacks the head of an NP with a PP daughter whose head is into relative to other PP heads or… P(into|PP,sacks is mother’s head)) But Other Relationships do Not Involve Heads (Hindle & Rooth ’91) • Affinity of gusto for eat is greater than for spaghetti; and affinity of marinara for spaghetti is greater than for ate Vp (ate) Vp(ate) Pp(with) np v Ate spaghetti with gusto Vp(ate) Np(spag) np Pp(with) v Ate spaghetti with marinara Log-linear models for Parsing • Why restrict to the conditioning to the elements of a rule? – Use even larger context…word sequence, word types, sub-tree context etc. • Compute P(y|x); where fi(x,y) tests properties of context and li is weight of feature e i i P( y | x)  li * f i ( x , y )  e l * f ( x, y )  yY • Use as scores in CKY algorithm to find best parse Supertagging: Almost parsing Poachers now control the underground trade N S VP N N Adv VP S NP NP V NP control e S N Adv poachers now S V e NP VP NP NP VP poachers now Det NP : : Adj trade N NP VP NP control N N trade S NP S N underground S V : : e Adj control Adv N N N V VP NP underground the NP VP V S S NP NP VP NP VP now poachers S S S NP VP e V e NP NP VP V NP e N Adj underground : trade Summary • Parsing context-free grammars – Top-down and Bottom-up parsers – Mixed approaches (CKY, Earley parsers) • Preferences over parses using probabilities – Parsing with PCFG and PCKY algorithms • Enriching the probability model – Lexicalization – Log-linear models for parsing – Super-tagging

Probabilistic and Lexicalized Parsing CS 4705

Related documents

Products

Support

Probabilistic and Lexicalized Parsing CS 4705

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib