Corpora and Statistical Methods Lecture 11 Albert Gatt Part 1 Probabilistic Context-Free Grammars and beyond Context-free grammars: reminder Many NLP parsing applications rely on the CFG formalism Definition: CFG is a 4-tuple: (N,Σ,P,S): N = a set of non-terminal symbols (e.g. NP, VP) Σ = a set of terminals (e.g. words) N and Σ are disjoint P = a set of productions of the form Aβ AЄN β Є (N U Σ)* (any string of terminals and non-terminals) S = a designated start symbol (usually, “sentence”) CFG Example S NP VP S Aux NP VP NP Det Nom NP Proper-Noun Det that | the | a … Probabilistic CFGs A CFG where each production has an associated probability PCFG is a 5-tuple: (N,Σ,P,S, D): D: P -> [0,1] a function assigning each rule in P a probability usually, probabilities are obtained from a corpus most widely used corpus is the Penn Treebank The Penn Treebank English sentences annotated with syntax trees built at the University of Pennsylvania 40,000 sentences, about a million words text from the Wall Street Journal Other treebanks exist for other languages (e.g. NEGRA for German) Example tree Building a tree: rules S NP NNP Mr VP NNP VBZ Vinken is S NP VP NP NNP NNP NNP Mr NNP Vinken … NP NP PP NN IN NN chairman of NNP Elsevier Characteristics of PCFGs In a PCFG, the probability P(Aβ) expresses the likelihood that the non- terminal A will expand as β. e.g. the likelihood that S NP VP (as opposed to SVP, or S NP VP PP, or… ) can be interpreted as a conditional probability: probability of the expansion, given the LHS non-terminal P(Aβ) = P(Aβ|A) Therefore, for any non-terminal A, probabilities of every rule of the form A β must sum to 1 If this is the case, we say the PCFG is consistent Uses of probabilities in parsing Disambiguation: given n legal parses of a string, which is the most likely? e.g. PP-attachment ambiguity can be resolved this way Speed: parsing is a search problem search through space of possible applicable derivations search space can be pruned by focusing on the most likely sub-parses of a parse Parser can be used as a model to determine the probability of a sentence, given a parse typical use in speech recognition, where input utterance can be “heard” as several possible sentences Using PCFG probabilities PCFG assigns a probability to every parse-tree t of a string W e.g. every possible parse (derivation) of a sentence recognised by the grammar Notation: G = a PCFG s = a sentence t = a particular tree under our grammar t consists of several nodes n each node is generated by applying some rule r Probability of a tree vs. a sentence P(t , s) p(r (n)) P(t ) nt simply the multiplication of the probability of every rule (node) that gives rise to t (i.e. the derivation of t) this is both the joint probability of t and s, and the probability of t alone why? P(t,s) = P(t) P(t , s) P(t ) P(s | t ) But P(s|t) must be 1, since the tree t is a parse of all the words of s P(t , s) P(t ) 1 P(t ) Picking the best parse in a PCFG A sentence will usually have several parses we usually want them ranked, or only want the n-best parses we need to focus on P(t|s,G) probability of a parse, given our sentence and our grammar definition of the best parse for s: tˆ arg max P(t | s, G) t Picking the best parse in a PCFG Problem: t can have multiple derivations e.g. expand left-corner nodes first, expand right-corner nodes first etc so P(t|s,G) should be estimated by summing over all possible derivations Fortunately, derivation order makes no difference to the final probabilities. can assume a “canonical derivation” d of t P(t) =def P(d) Probability of a sentence Simply the sum of probabilities of all parses of that sentence since s is only a sentence if it’s recognised by G, i.e. if there is some t for s under G P( s) P(s, t ) p(t ) t {t: yield (t ) s} all those trees which “yield” s Flaws I: Structural independence Probability of a rule r expanding node n depends only on n. Independent of other non-terminals Example: P(NP Pro) is independent of where the NP is in the sentence but we know that NPPro is much more likely in subject position Francis et al (1999) using the Switchboard corpus: 91% of subjects are pronouns; only 34% of objects are pronouns Flaws II: lexical independence vanilla PCFGs ignore lexical material e.g. P(VP V NP PP) independent of the head of NP or PP or lexical head V Examples: prepositional phrase attachment preferences depend on lexical items; cf: dump [sacks into a bin] dump [sacks] [into a bin] (preferred parse) coordination ambiguity: [dogs in houses] and [cats] [dogs] [in houses and cats] Weakening the independence assumptions in PCFGs Lexicalised PCFGs Attempt to weaken the lexical independence assumption. Most common technique: mark each phrasal head (N,V, etc) with the lexical material this is based on the idea that the most crucial lexical dependencies are between head and dependent E.g.: Charniak 1997, Collins 1999 Lexicalised PCFGs: Matt walks Makes probabilities partly S(walks) dependent on lexical content. P(VPVBD|VP) becomes: NP(Matt) P(VPVBD|VP, h(VP)=walk) NB: normally, we can’t assume that all heads of a phrase of category C are equally probable. NNP(Matt) Matt VP(walk) VBD(walk) walks Practical problems for lexicalised PCFGs data sparseness: we don’t necessarily see all heads of all phrasal categories often enough in the training data flawed assumptions: lexical dependencies occur elsewhere, not just between head and complement I got the easier problem of the two to solve of the two and to solve become more likely because of the prehead modifier easier Structural context The simple way: calculate p(t|s,G) based on rules in the canonical derivation d of t assumes that p(t) is independent of the derivation could condition on more structural context but then we could lose the notion of a canonical derivation, i.e. P(t) could really depend on the derivation! Structural context: probability of a derivation history How to calculate P(t) based on a derivation d? Observation: r1 r3 r2 rm P(d ) P(S 1 2 ... m s) (probability that a sequence of m rewrite rules in a derivation yields s) can use the chain rule for multiplication m P( d ) P(r | r ,...r i i 1 1 i 1 ) Approach 2: parent annotation Annotate each node with its S(walks) parent in the parse tree. E.g. if NP has parent S, then rename NP to NP^S Can partly account for dependencies such as subject-of NP^S VP^S NNP^NP VBD^VP (NP^S is a subject, NP^VP is an object) Matt walks The main point Many different parsing approaches differ on what they condition their probabilities on Other grammar formalisms Phrase structure vs. Dependency grammar PCFGs are in the tradition of phrase-structure grammars Dependency grammar describes syntax in terms of dependencies between words no non-terminals or phrasal nodes only lexical nodes with links between them links are labelled, labels from a finite list Dependency Grammar <ROOT> main GAVE obj: subj: dat: I him address attr: MY Dependency grammar Often used now in probabilistic parsing Advantages: directly encode lexical dependencies therefore, disambiguation decisions take lexical material into account directly dependencies are a way of decomposing PSRs and their probability estimates estimating probability of dependencies between 2 words is less likely to lead to data sparseness problems Summary We’ve taken a tour of PCFGs crucial notion: what the probability of a rule is conditioned on flaws in PCFGs: independence assumptions several proposals to go beyond these flaws dependency grammars are an alternative formalism