statLecture11a

Corpora and Statistical Methods Lecture 11 Albert Gatt Part 1 Probabilistic Context-Free Grammars and beyond Context-free grammars: reminder  Many NLP parsing applications rely on the CFG formalism  Definition:  CFG is a 4-tuple: (N,Σ,P,S):  N = a set of non-terminal symbols (e.g. NP, VP)  Σ = a set of terminals (e.g. words)  N and Σ are disjoint  P = a set of productions of the form Aβ  AЄN  β Є (N U Σ)* (any string of terminals and non-terminals)  S = a designated start symbol (usually, “sentence”) CFG Example  S  NP VP  S  Aux NP VP  NP  Det Nom  NP  Proper-Noun  Det  that | the | a  … Probabilistic CFGs  A CFG where each production has an associated probability  PCFG is a 5-tuple: (N,Σ,P,S, D):  D: P -> [0,1] a function assigning each rule in P a probability  usually, probabilities are obtained from a corpus  most widely used corpus is the Penn Treebank The Penn Treebank  English sentences annotated with syntax trees  built at the University of Pennsylvania  40,000 sentences, about a million words  text from the Wall Street Journal  Other treebanks exist for other languages (e.g. NEGRA for German) Example tree Building a tree: rules S NP NNP Mr      VP NNP VBZ Vinken is S  NP VP NP  NNP NNP NNP  Mr NNP Vinken … NP NP PP NN IN NN chairman of NNP Elsevier Characteristics of PCFGs  In a PCFG, the probability P(Aβ) expresses the likelihood that the non- terminal A will expand as β.  e.g. the likelihood that S  NP VP  (as opposed to SVP, or S  NP VP PP, or… )  can be interpreted as a conditional probability:  probability of the expansion, given the LHS non-terminal  P(Aβ) = P(Aβ|A)  Therefore, for any non-terminal A, probabilities of every rule of the form A  β must sum to 1  If this is the case, we say the PCFG is consistent Uses of probabilities in parsing  Disambiguation: given n legal parses of a string, which is the most likely?  e.g. PP-attachment ambiguity can be resolved this way  Speed: parsing is a search problem  search through space of possible applicable derivations  search space can be pruned by focusing on the most likely sub-parses of a parse  Parser can be used as a model to determine the probability of a sentence, given a parse  typical use in speech recognition, where input utterance can be “heard” as several possible sentences Using PCFG probabilities  PCFG assigns a probability to every parse-tree t of a string W  e.g. every possible parse (derivation) of a sentence recognised by the grammar  Notation:  G = a PCFG  s = a sentence  t = a particular tree under our grammar  t consists of several nodes n  each node is generated by applying some rule r Probability of a tree vs. a sentence P(t , s)   p(r (n))  P(t ) nt  simply the multiplication of the probability of every rule (node) that gives rise to t (i.e. the derivation of t)  this is both the joint probability of t and s, and the probability of t alone  why? P(t,s) = P(t) P(t , s)  P(t ) P(s | t )  But P(s|t) must be 1, since the tree t is a parse of all the words of s P(t , s)  P(t ) 1  P(t ) Picking the best parse in a PCFG  A sentence will usually have several parses  we usually want them ranked, or only want the n-best parses  we need to focus on P(t|s,G)  probability of a parse, given our sentence and our grammar  definition of the best parse for s: tˆ  arg max P(t | s, G) t Picking the best parse in a PCFG  Problem: t can have multiple derivations  e.g. expand left-corner nodes first, expand right-corner nodes first etc  so P(t|s,G) should be estimated by summing over all possible derivations  Fortunately, derivation order makes no difference to the final probabilities.  can assume a “canonical derivation” d of t  P(t) =def P(d) Probability of a sentence  Simply the sum of probabilities of all parses of that sentence  since s is only a sentence if it’s recognised by G, i.e. if there is some t for s under G P( s)   P(s, t )   p(t ) t {t: yield (t )  s} all those trees which “yield” s Flaws I: Structural independence  Probability of a rule r expanding node n depends only on n.  Independent of other non-terminals  Example:  P(NP  Pro) is independent of where the NP is in the sentence  but we know that NPPro is much more likely in subject position  Francis et al (1999) using the Switchboard corpus:  91% of subjects are pronouns;  only 34% of objects are pronouns Flaws II: lexical independence  vanilla PCFGs ignore lexical material  e.g. P(VP V NP PP) independent of the head of NP or PP or lexical head V  Examples:  prepositional phrase attachment preferences depend on lexical items; cf:  dump [sacks into a bin]  dump [sacks] [into a bin] (preferred parse)  coordination ambiguity:  [dogs in houses] and [cats]  [dogs] [in houses and cats] Weakening the independence assumptions in PCFGs Lexicalised PCFGs  Attempt to weaken the lexical independence assumption.  Most common technique:  mark each phrasal head (N,V, etc) with the lexical material  this is based on the idea that the most crucial lexical dependencies are between head and dependent  E.g.: Charniak 1997, Collins 1999 Lexicalised PCFGs: Matt walks  Makes probabilities partly S(walks) dependent on lexical content.  P(VPVBD|VP) becomes: NP(Matt) P(VPVBD|VP, h(VP)=walk)  NB: normally, we can’t assume that all heads of a phrase of category C are equally probable. NNP(Matt) Matt VP(walk) VBD(walk) walks Practical problems for lexicalised PCFGs  data sparseness: we don’t necessarily see all heads of all phrasal categories often enough in the training data  flawed assumptions: lexical dependencies occur elsewhere, not just between head and complement  I got the easier problem of the two to solve  of the two and to solve become more likely because of the prehead modifier easier Structural context  The simple way: calculate p(t|s,G) based on rules in the canonical derivation d of t  assumes that p(t) is independent of the derivation  could condition on more structural context  but then we could lose the notion of a canonical derivation, i.e. P(t) could really depend on the derivation! Structural context: probability of a derivation history  How to calculate P(t) based on a derivation d?  Observation: r1 r3 r2 rm P(d )  P(S 1  2 ...  m  s)  (probability that a sequence of m rewrite rules in a derivation yields s)  can use the chain rule for multiplication m P( d )   P(r | r ,...r i i 1 1 i 1 ) Approach 2: parent annotation  Annotate each node with its S(walks) parent in the parse tree.  E.g. if NP has parent S, then rename NP to NP^S  Can partly account for dependencies such as subject-of NP^S VP^S NNP^NP VBD^VP  (NP^S is a subject, NP^VP is an object) Matt walks The main point  Many different parsing approaches differ on what they condition their probabilities on Other grammar formalisms Phrase structure vs. Dependency grammar  PCFGs are in the tradition of phrase-structure grammars  Dependency grammar describes syntax in terms of dependencies between words  no non-terminals or phrasal nodes  only lexical nodes with links between them  links are labelled, labels from a finite list Dependency Grammar <ROOT> main GAVE obj: subj: dat: I him address attr: MY Dependency grammar  Often used now in probabilistic parsing  Advantages:  directly encode lexical dependencies  therefore, disambiguation decisions take lexical material into account directly  dependencies are a way of decomposing PSRs and their probability estimates  estimating probability of dependencies between 2 words is less likely to lead to data sparseness problems Summary  We’ve taken a tour of PCFGs  crucial notion: what the probability of a rule is conditioned on  flaws in PCFGs: independence assumptions  several proposals to go beyond these flaws  dependency grammars are an alternative formalism

statLecture11a

Related documents

Products

Support

statLecture11a

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib